Simplifying a Cascading pipeline used for aggregating sales data - hadoop

I'm very new to Cascading and Hadoop both, so be gentle... :-D
I think I'm finding myself way over-engineering something. Basically my situation is that I have a pipe delimited file with 9 fields. I want to compute some aggregated statistics over those 9 fields using different groupings. The result should be 10 fields of which only 6 are either counts or sums. So far I'm up to 4 Unique pipes, 4 CountBy pipes, 1 SumBy, 1 GroupBy, 1 Every, 2 Each, 5 CoGroups and a couple others. I'm needing to add another small piece of functionality and the only way I can see to do it is to add in 2 Filters, 2 more CoGroups and 2 more Each pipes. This all seems like way overkill just to compute a few aggregated statistics. So I'm thinking I'm really misunderstanding something.
My input file looks like this:
storeID | invoiceID | groupID | customerID | transaction date | quantity | price | item type | customer type
Item type is either "I", "S" or "G" for inventory, service or group items, customers belong to groups. The rest should be self-explanatory
The result I want is:
project ID | storeID | year | month | unique invoices | unique groups | unique customers | customer visits | inventory type sales | service type sales |
project ID is a constant, customer visits is how many days during the month the customer came in and bought something
The setup that I'm using right now uses a TextDelimited Tap as my source to read the file and passes the records to an Each pipe which uses a DateParser to parse the transaction date and adds in year, month and day fields. So far so good. This is where it gets out of control.
I'm splitting the stream from there up into 5 separate streams to process each of the aggregated fields that I want. Then I'm joining all the results together in 5 CoGroup pipes, sending the result through Insert (to insert the project ID) and writing through a TextDelimited sink Tap.
Is there an easier way than splitting into 5 streams like that? The first four streams do almost the exact same thing just on different fields. For example, the first stream uses a Unique pipe to just get unique invoiceID's then uses a CountBy to count the number of records with the same storeID, year and month. That gives me the number of unique invoices created for each store by year and month. Then there is a stream that does the same thing with groupID and another that does it with customerID.
Any ideas for simplifying this? There must be an easier way.

Related

I would like to create an efficient Bigtable row key

I would like to create an optimal row key in Bigtable. I have a table channel_data with 3 columns: channel_id,date,fan_count.
channel_id
date
fan_count
1
2022-03-01
5000
1
2022-03-02
6000
2
2022-03-01
200
2
2022-03-02
300
3
2022-03-03
1000
Users of our application can set up brands/buckets by adding multiple channels. Users can choose any random channel_id.
I want to design an efficient row key to fetch aggregated fan_count in a date range for a brand.
Let's say the user creates a brand with channel_id 1 and 3 and wish to see sum of all fans for the time period 2022-03-01 to 2022-03-03
The result should be 5000+6000+1000=12000
You have a few options here. Because you're looking to do queries based on date, you should probably make that the end part of your rowkey so you can scope down by brand first. You could also use timestamped cells to store multiple values for each channel. Perhaps a week or month of data, so it is grouped together in that way, but this isn't necessary.
Perhaps a rowkey like channel_id/yyyy-mm-dd is what you'd want. You can choose to store the date and channel info in the table, but it isn't necessary since you'd have it in your ids. You can just treat Bigtable like a key/value store in this instance which might be more optimal depending on your scenario.
If you choose to store a month of data per row, you would just make the rowkey something like channel_id/yyyy-mm and just timestamp each value for the day.
Either way for your queries, if you need multiple channels, then you could just do multiple reads or a multi-prefix scan. Let me know if this helps clarify the schema design and if you have more questions.

Perform an OR operation on same field from multiple rows SSRS

I am a beginner trying to achieve a simple operation in SSRS using Visual Studio 2019. I have a query which returns a table as follows
ID | Name | Married
1 | Jack | Y
2 | Jack | N
The number of records might vary depending on the number of results. On the report, I want to display only the field 'Married' once. The value of the field will be determined using an OR operation, i.e. if the field 'Married' is 'Y' for any one record, I want to display a 'Y' on the report.
Assuming the Values are either Y or N, you should be able to use something like
=MAX(Fields!Married.Value)
If you report is grouped by, for example, Name then this will give you the MAX value within each group which is probably what you want.
If this does not help, edit your question and show
Your report design
Row Group panel plus details of grouping
A larger sample of data
Expected results from that sample data

Creating advanced SUMIF() calculations in Quicksight

I have a couple of joined Athena tables in Quicksight. The data looks something like this:
Ans_Count | ID | Alias
10 | 1 | A
10 | 1 | B
10 | 1 | C
20 | 2 | D
20 | 2 | E
20 | 2 | F
I want to create a calculated field such that it sums the Ans_Count column based on distinct IDs only. i.e., in the example above the result should be 30.
How do I do that?? Thanks!
Are you looking for the sum before or after applying a filter?
Sumif(Ans_Count,ID) may be what your looking for.
If you need to always return the result of the sum, regardless of the filter on the visual, look at the sumOver() function.
You can use distinctCountOver at PRE_AGG level to count unique number of values for a given partition. You could use that count to drive the sumIf condition as well.
Example : distinctCountOver(operand, [partition fields], PRE_AGG)
More details about what will be visual's group by specification and an example where there duplicate IDs will help give a specific solution.
It might even be as simple as minOver(Ans_Count, [ID], PRE_AGG) and using SUM aggregation on top of it in the visual.
If you want another column with the values repeated, use sumOver(Ans_Count, [ID], PRE_AGG). Or, if you want to aggregate via QuickSight, you would use sumOver(sum(Ans_Count), [ID]).
I agree with the above suggestions to use sumOver(sum(Ans_Count), [ID]).
I have yet to understand the use cases for pre_agg, so if anyone has concrete examples please share them!
Another suggestion would be to do a sumover + partition by in your table (if possible) before uploading the dataset, then checking if the results matche with Quicksight's aggregations. I find Quicksight can be tricky with calculated fields, aggregations, and nested ifs so I've been doing calculations in SQL where possible before bringing it in to quicksight to have a better grasp of what the outputs should look like. This obviously is an extra step, but can help in understanding how quicksight pulls off calcs and brings up figures (as the documentation doesn't always give much), and spotting things that don't look right (I've had a few) before you share your analysis with a wider group.

Increase scan performance in Apache Hbase

I am working on an use case and help me in improving the scan performance.
Customers visiting our website are generated as logs and we will be processing it which is usually done by Apache Pig and inserts the output from pig into hbase table(test) directly using HbaseStorage. This will be done every morning. Data consists of following columns
Customerid | Name | visitedurl | timestamp | location | companyname
I have only one column family (test_family)
As of now I have generated random no for each row and it is inserted as row key for that table. For ex I have following data to be inserted into table
1725|xxx|www.something.com|127987834 | india |zzzz
1726|yyy|www.some.com|128389478 | UK | yyyy
If so I will add 1 as row key for first row and 2 for second one and so on.
Note : Same id will be repeated for different days so I chose random no to be row-key
while querying data from table where I use scan 'test', {FILTER=>"SingleColumnValueFilter('test_family','Customerid',=,'binary:1002')"} it takes more than 2 minutes to return the results.`
Suggest me a way so that I have to bring down this process to 1 to 2 seconds since I am using it in real-time analytics
Thanks
As per the query you have mentioned, I am assuming you need records based on Customer ID. If it is correct, then, to improve the performance, you should use Customer ID as Row Key.
However, multiple entries could be there for single Customer ID. So, better design Row key as CustomerID|unique number. This unique number could be the timestamp too. It depends upon your requirements.
To scan the data in this case, you need to use PrefixFilter on row key. This will give you better performance.
Hope this help..

Order By any field in Cassandra

I am researching cassandra as a possible solution for my up coming project. The more I research the more I keep hearing that it is a bad idea to sort on fields that is not setup for sorting when the table was created.
Is it possible to sort on any field? If there is a performance impact for sorting on fields not in the cluster what is that performance impact? I need to sort around or about 2 million records in the table.
I keep hearing that it is a bad idea to sort on fields that is not setup for sorting when the table was created.
It's not so much that it's a bad idea. It's just really not possible to make Cassandra sort your data by an arbitrary column. Cassandra requires a query-based modeling approach, and that goes for sort order as well. You have to decide ahead of time the kinds of queries you want Cassandra to support, and the order in which those queries return their data.
Is it possible to sort on any field?
Here's the thing with how Cassandra sorts result sets: it doesn't. Cassandra queries correspond to partition locations, and the data is read off of the disk and returned to you. If the data is read in the same order that it was sorted in on-disk, the result set will be sorted. On the other hand if you try a multi-key query or an index-based query where it has to jump around to different partitions, chances are that it will not be returned in any meaningful order.
But if you plan ahead, you can actually influence the on-disk sort order of your data, and then leverage that order in your queries. This can be done with a modeling mechanism called a "clustering column." Cassandra will allow you to specify multiple clustering columns, but they are only valid within a single partition.
So what does that mean? Take this example from the DataStax documentation.
CREATE TABLE playlists (
id uuid,
artist text,
album text,
title text,
song_order int,
song_id uuid,
PRIMARY KEY ((id),song_order))
WITH CLUSTERING ORDER BY (song_order ASC);
With this table definition, I can query a particular playlist by id (the partition key). Within each id, the data will be returned ordered by song_order:
SELECT id, song_order, album, artist, title
FROM playlists WHERE id = 62c36092-82a1-3a00-93d1-46196ee77204
ORDER BY song_order DESC;
id | song_order | album | artist | title
------------------------------------------------------------------------------------------------------------------
62c36092-82a1-3a00-93d1-46196ee77204 | 4 | No One Rides For Free | Fu Manchu | Ojo Rojo
62c36092-82a1-3a00-93d1-46196ee77204 | 3 | Roll Away | Back Door Slam | Outside Woman Blues
62c36092-82a1-3a00-93d1-46196ee77204 | 2 | We Must Obey | Fu Manchu | Moving in Stereo
62c36092-82a1-3a00-93d1-46196ee77204 | 1 | Tres Hombres | ZZ Top | La Grange
In this example, if I only need to specify an ORDER BY if I want to switch the sort direction. As the rows are stored in ASCending order, I need to specify DESC to see them in DESCending order. If I was fine with getting the rows back in ASCending order, I don't need to specify ORDER BY at all.
But what if I want to order by artist? Or album? Or both? Since one artist can have many albums (for this example), we'll modify the PRIMARY KEY definition like this:
PRIMARY KEY ((id),artist,album,song_order)
Running the same query above (minus the ORDER BY) produces this output:
SELECT id, song_order, album, artist, title
FROM playlists WHERE id = 62c36092-82a1-3a00-93d1-46196ee77204;
id | song_order | album | artist | title
------------------------------------------------------------------------------------------------------------------
62c36092-82a1-3a00-93d1-46196ee77204 | 3 | Roll Away | Back Door Slam | Outside Woman Blues
62c36092-82a1-3a00-93d1-46196ee77204 | 4 | No One Rides For Free | Fu Manchu | Ojo Rojo
62c36092-82a1-3a00-93d1-46196ee77204 | 2 | We Must Obey | Fu Manchu | Moving in Stereo
62c36092-82a1-3a00-93d1-46196ee77204 | 1 | Tres Hombres | ZZ Top | La Grange
Notice that the rows are now ordered by artist, and then album. If we had two songs from the same album, then song_order would be next.
So now you might ask "what if I just want to sort by album, and not artist?" You can sort just by album, but not with this table. You cannot skip clustering keys in your ORDER BY clause. In order to sort only by album (and not artist) you'll need to design a different query table. Sometimes Cassandra data modeling will have you duplicating your data a few times, to be able to serve different queries...and that's ok.
For more detail on how to build data models while leveraging clustering order, check out these two articles on PlanetCassandra:
Getting Started With Time Series Data Modeling - Patrick McFadin
We Shall Have Order! - Disclaimer - I am the author

Resources