Calculating statistics directly from a CSV file

Calculating statistics directly from a CSV file - bash

I have a transaction log file in CSV format that I want use to run statistics. The log has the following fields:
date: Time/date stamp
salesperson: The username of the person who closed the sale
promo: sum total of items in the sale that were promotions.
amount: grand total of the sale
I'd like to get the following statistics:
salesperson: The username of the salesperson being analyzed.
minAmount: The smallest grand total of this salesperson's transaction.
avgAmount: The mean grand total..
maxAmount: The largest grand total..
minPromo: The smallest promo amount by the salesperson.
avgPromo: The mean promo amount...
I'm tempted to build a database structure, import this file, write SQL, and pull out the stats. I don't need anything more from this data than these stats. Is there an easier way? I'm hoping some bash script could make this easy.

TxtSushi does this:
tssql -table trans transactions.csv \
'select
salesperson,
min(as_real(amount)) as minAmount,
avg(as_real(amount)) as avgAmount,
max(as_real(amount)) as maxAmount,
min(as_real(promo)) as minPromo,
avg(as_real(promo)) as avgPromo
from trans
group by salesperson'
I have a bunch of example scripts showing how to use it.
Edit: fixed syntax

Could also bang out an awk script to do it. It's only CSV with a few variables.

You can loop through the lines in the CSV and use bash script variables to hold your min/max amounts. For the average, just keep a running total and then divide by the total number of lines (not counting a possible header).
Here are some useful snippets for working with CSV files in bash.
If your data might be quoted (e.g. because a field contains a comma), processing with bash, sed, etc. becomes much more complex.

Related

Compare Dynamic Lists Power BI

I have a table ("Issues") which I am creating in PowerBI from a JIRA data connector, so this changes each time I refresh it. I have three columns I am using
Form Name
Effort
Status
I created a second table and have summarized the Form Names and obtained the Total Effort:
SUMMARIZE(Issues,Issues[Form Name],"Total Effort",SUM(Issues[Effort (Days)]))
But I also want to add in a column for
Total Effort for each form name where the Status field is "Done"
My issue is that I don't know how to compare both tables / form names since these might change each time I refresh the table.
I need to write a conditional, something like
For each form name, print the total effort for each form name, print the total effort for each form name where the status is done
I have tried SUMX, CALCULATE, SUM, FILTER but cannot get these to work - can someone help, please?

If all you need is to add a column to your summarized table that sums "Effort" only when the Status is set to 'Done' -- then this is the right place to use CALCULATE.
Table =
SUMMARIZE(
Issues,
Issues[Form Name],
"Total Effort", SUM(Issues[Effort]),
"Total Effort (Done)", CALCULATE(SUM(Issues[Effort]), Issues[Status] = "Done")
)
Here is a quick capture of what some of the mock data that I used to test this looks like. The Matrix is just the mock data with [Form Name] on the rows and [Status] on the columns. The last table shows the 'summarized' data calculated by the DAX above. You can compare this to the values in the matrix and see that they tie out.

How to make groups in an input and select a specific row in each of them in Talend?

I am working on a Talend transformation process (we are using Talend 6.4).
, and I don't know how to implement the current requirement.
I have an input consisting in :
Two columns that are my group keys (Account and Product), but are not unique (the same Account x Product couple can happen in multiple rows)
A criterion column (Contract end date), which will help me decide which row I want to keep for each group
Some "tail" data that need to be passed to the following step of the processing (the contract number)
The rule to implement is:
Keep only one record per group
The selected record must be one with no end date or, if all have end date, with the biggest end date
The selected record can be random in case there is a tie
See the transformation applying those rules on some dummy data:
I thought first to do the following:
sort by Account, Product, End_date (nulls first)
"select first" in each group
but I am not skilled enough to know whether the second transformation exists in Talend.
Regards,
Pierre

Very interesting Talend question.
You need to create something like this job.
here a link to the zip file to import in your Talend

The answer from #MBDIA seem to be working, however I would like to share what we did to fulfill our requirement.
See our Talend process here:
The first tMap (tMap_3) acts like a tReplicate and a tMap, and sends:
in the upper branch only the Account and Product references, that are then deduplicated by the tAggregateRow_1.
in the lower branch all data and computed fields that enables us to take care of the case where the date is missing (instead of defaulting to 31/12/9999, we compute a flag (0 or 1) that we use in the sort step afterwards).
In the second part of the process, we first apply the sort to the whole data on Account, Product, Empty date flag (computed before), End date (desc) and use a second tMap to make a join on both branches (on Account x Product), only keeping First Match in order to keep the first record as per our requirement.

SORT in JCL based on Current Date

Requirement: I need to sort an input file based on Date.
The date is in YYYYMMDD format starting at 56th Position in the flat file.
Now, the I am trying to write a sort card which writes all the records that have the date(YYYYMMDD) in the past 7 Days.
Example: My job is running on 20181007, it should fetch all the records that have date in between 20181001 to 20181007.
Thanks in advance.

In terms of DFSort you can use the following filter to select the current date as a relative value. For instance:
OUTFIL INCLUDE=(56,8,CH,GE,DATE1-7)
There are several definitions for Dates in various formats. I assume that since you are referring to a flat file the date is in a character format and not zoned decimal or other representation.
For DFSort here is a reference to the include statement
Similar constructs exist for other sort products. Without specifics about the product your using this is unfortunately a generic answer.

Speed up data comparison in powershell acquired via Import-CSV

Simple question, but tough problem.
I have 2 CSV files exported from Excel, one with 65k rows and one with about 50k. I need to merge the data from those 2 files based on that condition :
where File1.Username -eq File2.Username
Note that the datatype for the username property in both files is this :
IsPublic IsSerial Name BaseType
-------- -------- ---- --------
True True String System.Object
And obviously looping through 65k x 50k objects properties to compare takes..well, 1 day and 23 hours as I estimated when I measured a script run on only 10 rows.
I am considering several solutions at this point, like splitting the CSV files and running different portions in different powershell sessions simultaneously while giving real time priority to powershell.exe but that's cumbersome, and I haven't tested that option so I can't report on the real gain of performance.
I wondered if I should change the datatype rather, and use for instance .ToString.GetHashCode() but I tried that option too and oddly enough the execution time was quicker when comparing string VS string than hash sum integer VS hash sum integer.
So long story short, I am looking for a superfast way to compare 65k x 50k string variables.
Any help would be greatly appreciated :)
Thanks!
Elaborating example:
Ok here's a metaphorical example. Suppose you have a database containing the names and equipment of astronauts (SPACE), and another one containing the names and equipment of
marine explorers(OCEAN).
So in the SPACE dataset you'll have for instance:
First Name,Last name, Username, space gear,environment.
And then the first row of data would be like :
Neil,Armstrong,Stretch,spacesuit,moon
In the OCEAN Dataset you'd have :
First Name,Last name, Username, birthdate, diving gear,environment
with the following data:
Jacques,Cousteau,Jyc,1910-06-11,diving suit,ocean
Now suppose that at some point Neil Armstrong had himself registered to a diving course and was added the the OCEAN dataset.
In the OCEAN Dataset you'd now have :
First Name,Last name, Username, birthdate, diving gear,environment
with the following data:
Jacques,Cousteau,Jyc,1910-06-11,diving suit,ocean
Neil,Armstrong,Stretch,1930-08-05,diving suit,ocean
The person who handed me the data over gave me a third dataset which was a "mix" of the other 2 :
In the MIXED Dataset you'd now have :
Dataset,First Name,Last name, Username, birthdate, diving gear, space gear,environment
with the following data:
ocean,Jacques,Cousteau,Jyc,1910-06-11,diving suit,,ocean
space,Neil,Armstrong,Stretch,1930-08-05,,space suit,moon
ocean,Neil,Armstrong,Stretch,1930-08-05,diving suit,,ocean
So my task is to make the dataset MIXED looking like this:
First Name,Last name, Username, birthdate, diving gear, space gear,environment
Jacques,Cousteau,Jyc,1910-06-11,diving suit,,ocean
Neil,Armstrong,Stretch,1930-08-05,diving suit,space suit,(moon,ocean)
And to top it all off, there's a couple of profoundly stupid scenarios that can happen:
1) A same guy could be in either SPACE Dataset or OCEAN Dataset more than once, but with different usernames.
2) Two completely different users could share the same username in the SPACE Dataset, but NOT in the OCEAN Dataset.User names there are unique. Yes, you read that correctly, both Cousteau and Armstrong could potentially have the same username.
I've indeed already looked at the possibility of having the data cleaned up a little bit before getting my teeth stuck in that task, but that's not possible.
I have to take the context as it is, can't change anything.
So the first thing I did was to segregate the number of records for the username field, Group-Object -Property Username, and my work was focused on the cases were a given user was, like Neil Armstrong, in both datasets.
When there is only 1 record, like Cousteau, it's straight forward, I leave it as it is. When there's one record in each dataset I need to merge data, and when there is more than 2 records for one username then it's fair to say that it is a complete mess, although I don't mind leaving those as they are just now (especially because thousands of records have a [string]::IsNullOrEmpty($Username) = $true so they count as a number greater than 2 records..)
I hope it makes more sense?
At the moment I want to focus on the cases where a given username is showing up once in both SPACE and OCEAN datasets, I know its not complicated but the algorithm I am using makes the whole process super slow :
0 - Create an empty array
1 - Get rows from SPACE dataset
2 - Get rows from OCEAN dataset
3 - Create a hashtable containing the properties of both datasets where properties aren't empty
4 - Create a psobject to encapsulate the hashtable
5 - Add that object to the array
And that is taking ages because I am talking about 65k records in SPACE and about 50k records in OCEAN.
SO I was wondering if there's a better way of doing this?
Thanks !

Oracle spool Number rounding

I am calculating sum of all sales order (by multiplying quantity and price of a sales order - assume one sale order has only one item and using the sum function) in SQL query and I am spooling the output to a CSV file by using spool C:\scripts\output.csv.
The numeric output I get is truncated/rounded e.g. the SQL output 122393446 is made available in CSV as 122400000.
I tried to google and search on stackoverflow, but I could not get any hints about what can be done to prevent this.
Any clues?
Thanks

I think it is a xls issue.
Save as xls.
format column -> number with 2 decimals for example.

Initially I thought it might have something to do with the width of the number format which normally is 10 (NUMWIDTH) in sqlplus, but your result numeric width is 9, so that can not be the problem. Please check your query if you use a numeric type that doesn't have the required precission, and thus makes inexact calculations.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Calculating statistics directly from a CSV file - bash

Could also bang out an awk script to do it. It's only CSV with a few variables.

Related

Compare Dynamic Lists Power BI

How to make groups in an input and select a specific row in each of them in Talend?

SORT in JCL based on Current Date

Speed up data comparison in powershell acquired via Import-CSV

Oracle spool Number rounding

Categories

Resources