I am attempting to parse this HTML table representing a year's worth of temperature data, provided by an Australian government website.
This table is set up in an unusual way: the columns are months, and the rows are days of the month (so the first row's cells are JAN 1, FEB 1, MAR 1). Each cell contains a number if there's data recorded for that day, an empty cell if no data was recorded, or a cell class notDay if the day does not exist (eg Feb 31st).
My intent is to build a database full of this data in the format
DATE RAINFALL MAX TEMP
2015-02-07 35 31
2015-02-07 40 17
My question is: what would the simplest or most efficient (in terms of programmer efficiency) way to parse the table to get the data into a usable format?
I'm personally using Ruby with the Nokogiri library, but general non-language-specific algorithm/approach advice is welcome if it makes for a better discussion. I'm not looking for someone to write the code and solve the problem for me, but for advice about the approach to take.
I wonder if you can:
Take all the cells in the order they appear:
Use Array#flatten if you've got an array-of-array situation.
Discard any notDay cells with Array#reject
Iterate over all the relevant dates using a date range:
(Date.new(2014,1,1) .. Date.new(2014,12,31)).each {...}
And go from there...?
Related
Project
Cost
January
323
Feb
323
I have a table as followed seen above which ROW is month (filtered by a certain project) and values are cost of the project. I want to calcuate the difference between two months, but I am having trouble.
How do I subtract two rows from each other.
In the code I wrote:
Variance = [Cost] - CALCULATE([Cost],PREVIOUSMONTH('Month'[Month))
I get the following error, A column specified in the call to function is not of type date.
Is there a way to manual subtract two months?
The best way to do it is to replace your month with an actual Date value. The first of the month for example. The you should be able to do something like this assuming your month dates are unique: If they are not unique you should create a Dates table (See Microsoft's Guidance on date table) and join.
variance = [Cost] - Calculate([COST],(PARALLELPERIOD(Month[Month],-1,Month)
You can use EARLIER function here but, only when the Months have an Id. Such as
1 Jan
2 Feb
3 March
...
Link for details
However, I would suggest creating a date table and then having a relationship from the date table to your table. By using date table you can easily achieve using in-buit date functions.
I have got a table that only contains two column Legend (for Dates) and EOD Volume (for volume) as shown below.
I need to calculate the difference between the previous date volume. For example to calculate the difference between Feb 29 to March 2nd, it will be ((1469-1877) / 1469) * 100%. How to do create this measure in power BI. And the data also contains weekends and weekdays and i will need the analysis for all dates regardless of weekends and/or weekdays. Could someone please help me on this. Thank you in advance.
My propose solution works in a table at day granularity. Additionally, to handle working day the best-practice is to manage it as a binary attribute in the back-end because working days differ country by country so there is no standard dynamic way to handle them.
Possible Solution:=
VAR _YESTERDAY = CALCULATE(MAX('Fact'[EOD Volume]), PREVIOUSDAY('Calendar'[CalendarKey]))
VAR _TODAY = CALCULATE(MAX('Fact'[EOD Volume]))
RETURN
DIVIDE(_TODAY - _YESTERDAY, ABS(_YESTERDAY))
I have a huge CSV file (over 57,000 rows and 50 columns) that I need to analyze.
Edit: Hi guys, thanks for your answers and comments, but I am still really confused about how to do this in Ruby, and I have no idea how to use MySQL. I will try to be more specific:
The CSV files:
CSV on Storm Data Details for 2015
CSV on Storm Data Details for 2000
The questions:
Prior to question start, for all answers, exclude all rows that have a County/Parish, Zone, or Marine name that begins with the letters A, B, or C.
Find the month in 2015 where the State of Washington had the largest number of storm events. How many days of storm-free weather occurred in that month?
How many storms impacting trees happened between 8PM EST and 8AM EST in 2000?
In which year (2000 or 2015) did storms have a higher monetary impact within the boundaries of the 13 original colonies?
The problems:
1) I was able to use filters in Excel to determine that the most "Thunderstorm Wind" events in Washington happened in July (6 entries), and there were 27 days of storm-free weather. However, when I tried to check my work in Spotfire, I got completely different results. (7 entries in May, and 28 days of storm-free weather in May. Excel only found two Thunderstorm Wind events in May.) Do you know what could be causing this discrepancy?
2) There are two columns where damage to trees might be mentioned: Event_Narrative and Episode_Narrative. Would it be possible to search both columns for "tree" and filter the spreadsheet down to only those results? Multiple-column filtering is apparently impossible in Excel. I would also need to find a way to omit the word "street" in the results (because it contains the word "tree").
The method I came up with for the time range is to filter to only EST and AST results, then filter Begin_Time to 2000 to 2359 and 0 to 759 and repeat those ranges to filter End_Time. This appears to work.
3) I was able to filter the states to Delaware, Pennsylvania, New Jersey, Georgia, Connecticut, Massachusetts, Maryland, South Carolina, New Hampshire, Virginia, New York, North Carolina, and Rhode Island. It seems like a simple task to add all the values in Columns Y and Z (Damage_Property, Damage_Crops) and compare between the two years, but the values are written in the form "32.79K" and I cannot figure out how to make the adding equation work in that format or convert the values into integers.
Also, the question is asking for the original territory of the colonies, which is not the same as the territory those states now occupy. Do you know of a way to resolve this issue? Even if I had the time to look up each city listed, there does not seem to be a database of cities in the original 13 colonies online, and even if there was, the names of the cities may now be different.
I am learning Ruby and some people have suggested that I try to use the Ruby CSV library to put the data into an array. I have looked at some tutorials that sort of describe how to do that, but I still don't understand how I would filter the data down to only what I need.
Can anyone help?
Thank you!
I downloaded the data so I could play with it. You can get the record count pretty easily in Ruby. I just did it in irb:
require 'csv'
details = []
CSV.foreach("StormEvents_details-ftp_v1.0_d2015_c20160818.csv") do |row|
details << row
end
results = details.select do |field|
[field[-2], field[-3]].any? { |el| el[/\btree\b/i] } && field[8] == "CALIFORNIA"
end
results.count
=> 125
I just used array indices. You could zip things together and make hashes for better readability.
Wanted to post this as a comment but I don't have enough rep. Anyways:
I have converted CSV/xls files to JSON in the past with the help of some nodejs packages and uploaded them to my couchbase database. Within couchbase I can query with N1ql (really just SQL) which will allow you to achieve your goal of filtering multiple criterias. Like spickermann said, a database will solve your problem.
Edit:
My-Sql also supports importing a CSV file to a My-SQL table. Will be easier than the CSV to JSON to Couchbase
Csv-to-json
https://github.com/cparker15/csv-to-json/blob/master/README.md
I am working on my first application for mac which uses Core Data. Since I don't have much software development experience I would like to ask the more experienced developers the following question:
When entering data in some of the forms, user will have to enter a date in couple of the forms. Since app will be on app store and people from different continents will download it (I hope so) I am thinking of allowing the user to select his preferred date format from the preferences panel that I have in my app.
But I am wondering what will happen if after entering 500 or more records, he decide to change the date format again? Will that cause a mess in core data eventually?
Is this good idea or I should keep things simple and just get the system date (user computer date format) and use that date format? What would you do? Any advice will be deeply appreciated.
My advice is to keep date as timeinterval. You can see such method for NSDate.
The interval between the date object and 00:00:00 UTC on 1 January 1970.
So if you get NSDate object from NSDatFormatter object you will be able to obtain time in seconds since 1970. You could store this value in Core Data and use it later for creating NSDate objects. You will be able to use it for different locales and time zones as well as use the correct format.
'Dates' is complex topic and I suggest you to read guides about dates and date formatters.
First is to decide how you should store the date. The answer here is as an NSDate. The NSDate is a single unique precise point in time, thus it in a sense stores both date and time.
This means that for example 1 PM in Berlin and 8 pm in Kuala Lumpur will be the exact same NSDate value (during winter months) but 2 pm in London and 2 pm in Paris the same calendar date will not be the same NSDate value. This is a quite complex topic, read the date and time programming topics documentation from Apple.
Then as you say you need to allow you user to input the date. The way to do that is to use a NSDateFormatter tied to your input control. The formatter can be defined to be as per system settings, which means you will get the localisation you are seeking for free, so that is in fact easy.
The tricky thing you are really facing is to determine what you are really looking to store if it is only the calendar date without an associated time you want to store. For example you decide store the date combined with 12.00 noon in the local timezone. Then if the user shifts to another timezone more than 12 hours away the date may be displayed as the previous date or the next. The safest bet is to store the date combined with 12:00 noon GMT as this is in the middle of the time zone range. There are a few locations 13 and 14 hours off that could exhibit the mentioned problem anyway, but these are small atolls in the pacific and could possibly be safely ignored.
However the the best thing is if you can in fact determine that what you are looking to store is really a precise point in time rather than a date (which is a 24 hour fuzzy definition). For example in a calendar app an event usually takes place at a specific time on a specific date, then store that time and date.
Ok, I've seen similar questions on here, but nothing exactly the same. I am creating reports based on a cube that reads data from a DW. A lot of the reports tend to be along the lines of Value by Something By Week or Value By Something By Month. Everything seems ok, but the week and month (columns) don't order correctly. Week 10 goes before Week 9, February comes before January, etc. Im very frustrated bc I can't get these things to work correctly.
To add to this, at some point my customer needs to be able to write their own reports against the cube using Reportbuilder 3.0. So, I am reluctant to rely on manually editing the query. SURELY there is some obvious way to do this. In my DimDate I have a weekname that is a varchar, a week that is date, etc. Same for month.
Im missing something obvious here.
Thanks!
The sort order would make sense (varchars are strings {"Week 10", "Week 9"}, and {"February", "January"}) in that they are coming before their respective pair in the examples you've given, assuming an ASCII type of sort on the string values.
There are multiple ways to have ascending sort with strings as column headers (assuming ASCII type sorting on the string field):
Ensure week numbers are two digits in length e.g. "Week 9" would become "Week 09". This will ensure that the week columns are sorted in ascending order (or descending order, which ever is the case).
Add a month number in front of the month name e.g. "01 January", "02 February" -> You will still need two digit month numbers otherwise you will get the same issue you had with week numbers.
Use formatted dates as opposed to strings, as dates will be sorted properly.
Alternatively, if the issue is being caused in the dimension within the cube you can ensure any order by clauses are on keys, and not name fields.