I am totally new to data structure and algorithms. As much as i am learning and trying to pick up the various functions i was asked to display 5 sets of data structure based on a sample data in csv.
Each file contains a year’s worth of data for multiple sensors. Data for each date-time recording are on separate rows. Within each row, the value for each sensor is separated by a comma. There are a total of 105,120 rows per year/file. Currently the client has 10 years of data which is a million records.
I am supposed to find out:
The maximum wind speed of a specified month and year.
The median wind speed of a specified year.
Average wind speed for each month of a specified year. Display the data in the order of month (Jan, Feb, Mar, ...)
Total solar radiation for each month of a specified year. Display the data in a descending order of the solar radiation (i.e. month with the highest total solar radiation will display first).
Given a date, show the times for the highest solar radiation for that date. There can be one or more time values with the same highest solar radiation. Display the list of times in descending order (e.g. 24:00, 23:00, 22:00, etc.)
As i am new to Data structure. I have been cracking hard on the type of algorithms to propose on the above.
I am thinking if i can use:
BST Binary Search Tree to solve Qns 1
Linear for Qns 2
Constant to sort and linear to find the average for Qns 3
Both linear for Qns 4 and 5.
Anyone have a better suggestion or sample pseudo code to share on this. Or how should i start.
Regards, Heaptie
Related
I've been Googling around this problem for hours and haven't found a solution that suits my needs.
I have a large data set with agent activities and the total time in seconds each activity lasts. I'm pulling this together in a matrix, to display agent names on the left and the start date of each week across the top like so:
This is working as intended (I've used a measure to convert the seconds into hours) but I need the average of the displayed weeks as another column, or to replace the Total column.
I've tried solutions involving DAX measures but none are applicable, likely because I'm using a custom column (WeekStart) to roll up my numbers into weeks. Adding more complexity is I have 2 filters on the matrix; one to exclude any weeks older that 5 weeks in the past and another to exclude any future weeks.
In Excel I'd just add another column next to the table, averaging the 5 cells to the left of it. I could add it to the data table with a SUMIFS checking the Activity date is within the week range and dividing the result by 5. I can't do either of these in PowerBI and I'm new to the software so I'm at a loss as to how to do this.
This is my input file, having months of data.
Now I want to create this matrix in the PBIX, with 12 months of trailing data from a date slicer.
That is I will have a date slicer and I want to see the previous 12 months of data from the selected month.
For example, if I select Jan'21 in my slicer I want to have data from Jan'21 to Feb'20.
I watched some of the videos, but those videos focused on measures or rolling average or rolling total, in my case I want columns.
I have no idea how to implement this in Power BI. Thanks.
EDIT:- The Actual Exp and Actual Min column also have N/A for various months.
I am in the process of designing an algorithm that will calculate regions in a candlestick chart where strong areas of support exist. An "area of support" in this case is defined as an area in the chart where the price of a stock rises by a large amount in a short period of time. (Please see the diagram below, the blue dots represent these strong areas of support)
The data I am working with is a list of over 6000 TOHLC (timestamp, open price, high price, low price, close price) values. For example, the first entry in this list of data is:
[1555286400, 83.7, 84.63, 83.7, 84.27]
The way I have structured the algorithm to work is as follows:
1.) The list of 6000+ TOHLC values are split into sub-lists of 30 TOHLC values (30 is a number that I arbitrarily chose). The lowest low price (LLP) is then obtained from each of these sub-lists. The purpose behind using this method is to find areas in the chart where prices dip.
2.) The next step is to determine how high the price rose from each of these lows. For this, I take the next 30 candlestick values from the low and determine what the highest high price (HHP) is. Then, if HHP / LLP >= 1.03, the low price is accepted, otherwise it is discarded. Again, 1.03 is a value that I arbitrarily chose, by analysing the stock chart manually and determining how much the price rose on average from these lows.
The blue dots in the chart above represent the accepted areas of support by the algorithm. It appears to be working well, in terms of that I am trying to achieve.
So the question I have is: does anyone have any improvements they can suggest for this algorithm, or point out any faults in it?
Thanks!
I may have understood wrong, however, from your explanation it seems like you are doing your calculation in separate 30-ish sub lists and then combining them together.
So, what if the LLP is the 30th element of sublist N and HHP is 1st element of sublist N+1 ? If you have taken that into account, then it's fine.
If you haven't taken that into account, I would suggest doing a moving-window type of approach in reading those data. So, you would start from 0th element of 6000+ TOHLC and start with a window size of 30 and slide it 1 by 1. This way, you won't miss any values.
Some of the selected blue dots have higher dip than others. Why is that? I would separate them into another classifier. If you will store them into an object, store the dip rate as well.
Floating point numbers are not suggested in finance. If possible, I'd use a different approach and perhaps classifier, solely using integers. It may not bother you or your project as of now, but surely, it will begin to create false results when the numbers add up in the future.
I have a set of messages which each have an arrival timestamp. I would like to analyze the set and determine the periodicities of the messages' arrival. (Having that, I can then detect with some degree of certainty, when subsequent messages are late or missing.) So a discrete Fourier transform seems the logical choice to pull the frequency(ies) from the set.
However, all the explanations of discrete Fourier transforms which I've seen, start from a finite set of values sampled at a constant frequency. Whereas what I have is simply a set of values (monotonically increasing timestamp values.)
Convert to time series data?
I've thought of selecting a small resolution -- e.g. one second -- and then producing a time series beginning at the time of the first message, through the current real time, and a corresponding value of (0,1) at each of those time points. (Mostly zeros, with ones at the arrival time of each message.)
More specifics
I have many sets: I need to perform this calculation many times as I have many different sets of messages to analyze. Each set of messages might be on the order of 1,000 messages spanning up to a year of real time. So if I converted, (as I'm thinking above,) one set of messages into a time series; that's ~32 million (seconds in a year) time series data points, with only ~1,000 non-zero values.
Some of the message sets are more frequent: ~5,000 message over the scale of days -- so that would be more like ~400,000 time series data points, but still with only ~5,000 non-zero values.
Is this sane (convert arrival times to a time series and then head for plain FFT work)? Or is there a different way to apply Fourier transforms to my actual data (message arrival times)?
I suggest that you bin the message counts into evenly-spaced bins of a suitable duration, and then treat these bins as a time series and generate a spectrum from the series, using e.g. an FFT-based method. The resulting spectrum should show any periodicities as peaks around particular bin frequencies.
Goal
How to encode the data that describes how to re-order a static list from a one order to another order using the minimum amount of bytes possible?
Original Motivation
Originally this problem arose while working on a problem relaying sensor data using expensive satellite communication. A device had a list of about 1,000 sensors they were monitoring. The sensor list couldn't change. Each sensor had a unique id. All data was being logged internally for eventual analysis, the only thing that end users needed daily was which sensor fired in which order.
The entire project was scrapped, but this problem seems too interesting to ignore. Also previously I talked about "swaps" because I was thinking of the sorting algorithm, but really it's the overall order that's important, the steps required to arrive at that order probably wouldn't matter.
How the data was ordered
In SQL terms you could think of it like this.
**Initial Load**
create table sensor ( id int, last_detected datetime, other stuff )
-- fill table with ids of all sensors for this location
Day 0: Select ID from Sensor order by id
(initially data is sorted by the sensor.id because its a known value)
Day 1: Select ID from Sensor order by last_detected
Day 2: Select ID from Sensor order by last_detected
Day 3: Select ID from Sensor order by last_detected
Assumptions
The starting list and ending list is composed of the exact same set of items
Each sensor has a unique id (32 bit integer)
The size of the list will be approximately 1,000 items
Each sensors may fire multiple times per minute or not at all for days
Only the change in ID sort order needs to be relayed.
Computation resources for figuring optimal solutions is cheap / unlimited
Data costs are expensive, roughly a dollar per kilobyte.
Data could only be sent as whole byte (octet) increments
The Day 0 order is known by the sender and receiver to start with
For now assume the system functions perfectly and no error checking is required
As I said the project/hardware is no more so this is now just an academic problem.
The Challenge!
Define an Encoder
Given A. Day N sort order
Given B. Day N + 1 sort order
Return C. a collection of bytes that describe how to convert A to B using the least number of bytes possible
Define a Decoder
Given A. Day N sort order
Given B. a collection of bytes
Return C. Day N + 1 sort order
Have fun everyone.
As an academic problem, one approach would be to look at Algorithm P section 3.3.2 of Vol II of Knuth's the art of computer programming, which maps every permutation on N objects into an integer between 0 and N!-1. If every possible permutation is equally likely at any time, then the best you can do is to compute and transmit this (multi-precision) integer. In practice, giving each sensor a 10-bit number and then packing those 10 bit numbers up so you have e.g. 4 numbers packed into each chunk of 5 bytes would do almost as well.
Schemes based on diff or off the shelf compression make use of knowledge that not all permutations are equally likely. You may have knowledge of this based on the equipment, or you could see if this is case by looking at previous data. Fine if it works. In some cases with sensors and satellites you might want to worry about rare exceptions where you get worst case behaviour of your compression scheme and you suddenly have more data to transmit than you bargained for.