I have data that looks like the following image, but reversed, starting small and increasing over time. All of the data is currently in a csv file, with time being one column and the values being another. I was wondering what transformations would be possible in order to use the data and discover trends or a relationship. Thank you!
I am doing a science project about classifying medical images but I do not have a lot of data so, is it okay if I augment the data first then randomly select the data to keep and split the kept data afterward? At first, my teacher told me to augment the data first then split the data into train, validation, and test. But I think my proposed method will make the training dataset collide with the testing dataset which will cause the accuracy to be unrealistic(way too high), so I thought my method that randomly chooses the files after doing data augmentation should help the augmented dataset to not be too similar to each other and solve the imbalanced amount of dataset problem.
We want our model to generalize well on training set, so technically, we should do data augmentation only on the training set. I would suggest that you split your data-set into training, validation and testing, then do data augmentation only on training set.
What is the best practice for including Created By, Created Timestamp, Modified By, Modified Timestamp into a dimensional model?
The first two never change. The last two will change slowly for some data elements but rapidly for other data elements. However, I'd prefer a consistent approach so that reporting users become familiar with it.
Assume that I really only care about the most recent value; I don't need history.
Is it best to put them into a dimension knowing that, for highly-modified data, that dimension is going to change often? Or, is it better to put them into the fact table, treating the unchanging Created information much the same way a sales order number becomes a degenerate dimension?
In my answer I will assume that these ADDITIONAL Columns do NOT define the validity of the Dimensional record and that you are talking about a Slowly Changing Dimension type 1
So we are in fact talking about dimensional metadata here, about who / which process created or modified the dimensional row.
I would always put this kind of metadata in the dimension because it:
Is related to changes in the dimension. These changes happen independent of the fact table
In general it is advised to keep Fact tables as small as possible. If your Fact table would contain 5 Dimensions, this would lead to adding 5*4=20 extra columns to your fact table which will seriously bloath it and impact performance.
I am having a large dimension and it is taking me more and more time to process it. I would like to decrease the processing time as much as possible
there is literally hundreds of different articles on how to process ssas objects as efficient and fast as possible.
There are lots of tips and tricks that one can apply to speed up dimensions and cube processing. I managed to apply all or at least a big majority of them and I am still not happy with the result,.
I have a large dimension built on top of a table.
It has around 60 mil records and it keeps on growing fast.
It either add new rows to it or delete the existing ones. there are no updates possible
I am looking for a solution that will allow me to perform an incremental processing of my dimension.
I know that the data in the previous month will not be changed. I would like to do smth similar to partitioning of my cube but on the dimension.
I am using SLQ SERVER 2012 and to my knowledge dimension partitioning is not supported.
I am currently using process update on my dimension - I tried processing using by attribute and by table but both render almost the same result. I have hierarchies and relationships - some set to rigid. I am only using those attributes that are truly needed etc etc etc
process update has to read all the records in a dimension even those that i know have not changed. is there a way to partition a dimension? if I could tell SSAS to only process the last 3-4 weeks of data in my dimension and not touch the rest - it would greatly speed up my processing time.
I would appreciate your help
ok so I did a bit of research and I can confirm that incremental dimension processing is not supported.
it is possible to do process add on a dimension but if you have records that got deleted or updated you cannot do that
it would be a useful thing to have but MS hasn't developed it and I don't think it will
incremental processing of any table is however possible in tabular cubes
so if you have a similar requirement and your cube is not too complex then creating a tabular cube is the way to go
I need some suggestions on using d3.js for visualizing big data. I am pulling data from hbase and storing in a json file for visualizing using d3.js. When I pull the data of few hours the size of json file is around 100MB and can be easily visualized by d3.js but the filtering using dc.js and crossfilter is little slow. But when I pull the dataset of 1 week the json file size becomes more than 1GB and try to visualize using d3.js, dc.js and crossfilter then the visualization is not working properly and the filtering is also not possible. Can anyone give me any idea whether there is a good solution to this or I need to work on different platform instead of d3?
I definitely agree with what both Mark and Gordon have said before. But I must add what I have learnt in the past months as I scaled up a dc.js dashboard to deal with pretty big datasets.
One bottleneck is, as pointed out, the size of your datasets when it translates into thousands of SVG/DOM or Canvas elements. Canvas is lighter on the browser, but you still have a huge amount of elements in memory, each with their attributes, click events, etc.
The second bottleneck is the complexity of your data. The responsiveness of dc.js depends not only on d3.js, but also on crossfilter.js. If you inspect the Crossfilter example dashboard, you will see that the size of the data they use is quite impressive: over 230000 entries. However, the complexity of those data is rather low: just five variables per entry. Keeping your datasets simple helps scaling up a lot. Keep in mind that five variables per each entry here means about one million values in the browser's memory during visualization.
Final point, you mention that you pull the data in JSON format. While that is very handy in Javascript, parsing and validating big JSON files is quite demanding. Besides, it is not the most compact format. The Crossfilter example data are formatted as a really simple and tight CSV file.
In summary, you will have to find the sweet spot between size and complexity of your data. One million data values (size times complexity) is perfectly feasible. Increase that by one order of magnitude and your application might still be usable.
As #Mark says, canvas versus DOM rendering is one thing to consider. For sure the biggest expense in Web visualization is DOM elements.
However, to some extent crossfilter can mitigate this by aggregating the data into a smaller number of visual elements. It can get you up into the hundreds of thousands of rows of data. 1GB might be pushing it, but 100s of megabytes is possible.
But you do need to be aware of what level you are aggregating at. So, for example, if it's a week of time series data, probably bucketing by the hour is a reasonable visualization, for 7*24 = 168 points. You won't actually be able to perceive many more points, so it is pointless asking the browser to draw thousands of elements.
GAMBIT file has the inputs for a n-player strategic form game. It consists of payoff values for each player's strategy. In order to do manipulations on the data, the data needs to be stored in some data structure efficiently.
Since the number of players and the number of actions are user-inputs, this is posing a bit of a problem. I am not able to come up with a better data structure than linear array to store the data. Any help is appreciated.
The code needs to be in C/C++/Java. So any suggestions on doing it in MATLAB or other software won't be helpful.
For those who needs to know about GAMUT and GAMBIT, refer