Import data from URL - wolfram-mathematica

The St. Louis Federal Reserve Bank has a great set of data available on a variety of their web pages, such as:
http://research.stlouisfed.org/fred2/series/OILPRICE/downloaddata?cid=32217
http://www.federalreserve.gov/releases/h10/summary/default.htm
http://research.stlouisfed.org/fred2/series/DGS20
The data sets get updated, some as often as daily. I tend to have an interest in the daily data (see the above settings on the URLS)
I'd like to import these kinds of price or rate data streams (accessible as CSV or Excel files at the above URLs) directly into Mathematica.
I've looked at the documentation on Importing[] but I find scant documentation (actually none) on how to go about something like this.
It looks like I need to navigate to the pages, send some data to select specific files and formats, trigger the download, then access the downloaded data from my own machine. Even better if I could access the data directly from the sites.
I had hoped Wolfram Alpha might make this sort thing easy, but I haven't had any success.
FinancialData[] would seem natural for this sort of thing, but I don't see anyway to do it. Financial data has lots of features, but I don't see a way yo get this sort of thing.
Does anyone have any experience with this or can someone point me in the right direction?

You can Import directly from a URL. For example, the data from federalreserve.gov can be obtained and visualized as follows.
url = "http://www.federalreserve.gov/datadownload/Output.aspx?";
url = url<>"rel=H10&series=a660e724c705cea4b7bd1d1b85789862&lastObs=&";
url = url<>"from=&to=&filetype=csv&label=include&layout=seriescolumn";
data = Import[url, "CSV"];
DateListPlot[data[[7 ;;]], Joined -> True]
I broke up url for convenience, since it's so long. I had to examine the contents of data before I knew exactly how to plot it - a step that is typically necessary. I'm sure that the data from stlouisfed.org can be obtained in a similar way, but it requires the use of an API with key to access it.

As Mark said, you can get the data directly from a URL. Your oil data can be imported from a different URL than you had:
http://research.stlouisfed.org/fred2/data/OILPRICE.txt
With that URL, you can do this:
oil = Import["http://research.stlouisfed.org/fred2/data/OILPRICE.txt",
"Table", "HeaderLines" -> 12, "DateStringFormat" -> {"Year", "Month", "Day"}];
DateListPlot[oil, Joined -> True, PlotRange -> All]
Note that "HeaderLines"->12 option strips off the header text in the first 12 lines (you have to count the header lines to know how many to remove). I've also specified the date format.
To find that URL, do as you did before, but click on a data series and then choose View Data from the menu on the left when you see the chart.

The documentation has a short example on extracting data out of a webpage:
http://reference.wolfram.com/mathematica/howto/CleanUpDataImportedFromAWebsite.html
Of course, what actually needs to be done will vary significantly from page to page.

discussion on how to do this with your API key here:
http://library.wolfram.com/infocenter/MathSource/7583/
the function is based on the API documentation. I haven't looked at the code for a couple of years and from memory I put it together rather quickly but I have used it regularly for over 2 years without problems. Here is an example for monthly non seasonally adjusted retail sales from early 1992 to now:
wolfram alpha also uses FRED data so you could use that as an alternative to direct import but it is more tricky to get the query right. I prefer to use FRED directly. Also from memory the data is only available on alpha the day after the release, which is not what you would typically want.

Related

Getting most relevant content from page

I need to create a universal web scraper to parse articles on the different websites. Of course, I know about XPath, but I want to try to make it universal for any website despite the HTML markup of a page.
I need to determine whether there is an article on the page and if it is - parse a text of title, body and tags (if exists).
Frankly speaking, my knowledge in DS is not very huge, but I assume this task (determine whether it is article, and parsing only needed parts) is possible to solve.
What tools should I use? Any help?
Actually, for the second task, I need to implement something similar that google chrome mobile does. When page is not optimised for mobile, then propose to show the page in adaptive mode (just title, and main content).
If you are using Python, some libraries to look at are:
scrapy, which scrapes data and can extract some of the results) and,
BeautifulSoup, which is more geared towards the extraction part itself.
It is possible to request a version of a website (e.g. for Chrome, Safari, Mobile, old-school systems) by creating a custom header for your scraper.
HAve a look at the relevant documentation, and you can get an idea of how to use headers in scrapy here.
I do not know of any more specialised tools. Your tasks are more analytical and are typically not performed with the use of models for estimating e.g. what content is where on a webpage. This might be an intersting research direction though; to see if you can create a model that generalises across many websites to extract the desired content.
That leads me on to my last point, which is to say that creating a single scraper that works for any website *containing your artile type) is not usually possible. People create websites differently, however they see fit, which means they also change them. This usually leads to a good scraper requiring constant updates as time (and developers) moves on.
EDIT:
Then if you have lots of labelled examples, it might be possible to train a model. The challenge might be the look-back range of the model. For example, a typical LSTM model is given a parameter that tells it how far to look back into the past. It is stored within its memory internally. In your case, you might be looking for a start and end HTML tag of an article, to then extract just that part. These tahs could be thousands of words apart. Something a standard LSTM might not be fit to retain and use.
If you could pose your problem a little differently, then there are other approaches that might be plausible. E.g., you could make it a "question-answer" problem, by saying: I have this HTML, where is the article content? If that sounds ok for your use-case, have a look here for some model based approaches.

Generate EDGAR FTP File Path List

I'm brand new to programming (though I'm willing to learn), so apologies in advance for my very basic question.
The [SEC makes available all of their filings via FTP][1], and eventually, I would like to download a subset of these files in bulk. However, before creating such a script, I need to generate a list for the location of these files, which follow this format:
/edgar/data/51143/000005114313000007/0000051143-13-000007-index.htm
51143 = the company ID, and I already accessed the list of company IDs I need via FTP
000005114313000007/0000051143-13-000007 = the report ID, aka "accession number"
I'm struggling with how to figure this out as the documentation is fairly light. If I already have the 000005114313000007/0000051143-13-000007 (what the SEC calls the "accession number") then it's pretty straightforward. But I'm looking for ~45k entries and would obviously need to generate these automatically for a given CIK ID (which I already have).
Is there an automated way to achieve this?
Welcome to SO.
I'm currently scraping the same site, so I'll explain what I've done so far. What I am assuming is that you'll have the CIK numbers of the companies you're looking to scrape. If you search the company's CIK, you'll get a list of all of the files that are available for the company in question. Let's use Apple as an example (since they have a TON of files):
Link to Apple's Filings
From here you can set a search filter. The document you linked was a 10-Q, so let's use that. If you filter 10-Q, you'll have a list of all of the 10-Q documents. You'll notice that the URL changes slightly, to accommodate for the filter.
You can use Python and its web scraping libraries to take that URL and scrape all of the URLs of the documents in the table on that page. For each of these links you can scrape whatever links or information you want off the page. I personally use BeautifulSoup4, but lxml is another choice for web scraping, should you choose Python as your programming language. I would recommend using Python, as it's fairly easy to learn the basics and some intermediate programming constructs.
Past that, the project is yours. Good luck, I've posted some links below to get you started. I'm only allowed to post two links since I'm new to the site, so I'll give you the beautiful soup link:
Beautiful Soup Home Page
If you choose to use Python and are new to the language, check out the codecademy python course, and don't forget to check out lxml, as some people prefer it over BeautifulSoup (some people also use both in conjunction, so it's all a matter of personal preference).

Google Sheets xpath query not working

Hoping someone smarter than me can help me sort this out! I've been stumped for a few days now trying to pull some data from website into Google Sheet using ImportXML with no luck.
I'm looking to import the average odds for various sporting events from the website Oddsportal.com which update and change throughout the day. I'd like my sheet to also update these odds, similar to stock prices.
For example:
http://www.oddsportal.com/search/San+Jose+Sharks/
I would like to pull the Average Odds for Team "1" (+136) into a cell, the odds for Tie "X"(+277) into a cell and Team "2"(+161) into individual cells. Just the odds portion. If it's unable to be pulled from that page it is also listed on http://www.oddsportal.com/hockey/usa/nhl/san-jose-sharks-nashville-predators-6cPaAHOM/ down at the bottom in the Average Odds Row.
This seems simple enough but I just can't seem to get the ImportXML query correct without an error.
I've looked at the page's source code (Ctrl-U). The original html does not contain needed values, they most likely loaded later thru xhr (ajax) call:
So most likely you'll not succeed with mere a request html.
You need to explore Network in the browser DevTools to find out what request is initiated (by JS files) to get needed data. This might be even unique one containing hash signiture, so you'll not reproduce it for future use.
I recommend you to turn to scriping tools for retrieving that info.

Generating vector data (points) for OpenLayers Cluster

In my web application I am going to use OpenLayers.Strategy.AnimatedCluster strategy due to the fact that I need to visualize a great amount of point features. Here is a very good example of what it looks like. In both examples in above mentioned example the data (point features) are generated of taken from the GeoJSON file.
So, can anybody provide me with a file containing 100 000+ (better is even 500 000+) features (world cities, for instance), or explain how I can generate them so that they will be located all over the world (not like in Spain in the first example in above mentioned link).
use a geolocation database to supply you the data you need. GeoLite, for example
If 400K+ locations is ok, use download their CSV CITY LIST
If you want more, then you might want to give the Nominatim downloads, but they are quite bulky (more than 25GB) and parsing data is not as simple as a csv file.

Programmatically find common European street names

I am in the middle of designing a web form for German and French users. Within this form, the users would have to type street names several times.
I want to minimize the annoyance to the user, and offer autocomplete feature based on common French and German street names.
Any idea where I can a royalty-free list?
Would your users have to type the same street name multiple times? Because you could easily prevent this by coding something that prefilled the fields.
Another option could be to use your user database as a resource. Query it for all the available street names entered by your existing users and use that to generate suggestions.
Of course this would only work if you have a considerable number of users.
[EDIT] You could have a look at OpenStreetMap with their Planet.osm dumbs (or have a look here for a dump containing data for just Europe). That is basically the OSM database with all the map information they have, including street names. It's all in an XML format and streets seem to be stored as Ways. There are tools (i.e. Osmosis) to extract the data and put it into a database, or you could write something to plough through the data and filter out the street names for your database.
Start with http://en.wikipedia.org/wiki/Category:Streets_in_Germany and http://en.wikipedia.org/wiki/Category:Streets_in_France. You may want to verify the Wikipedia copyright isn't more protective than would be suitable for your needs.
Edit (merged from my own comment): Of course, to answer the "programmatically" part of your question: figure out how to spider and scrape those Wikipedia category pages. The polite thing to do would be to cache it, rather than hitting it every time you need to get the street list; refreshing once every month or so should be sufficient, since the information is unlikely to change significantly.
You could start by pulling names via Google API (just find e.g. lat/long outer bounds - of Paris and go to the center) - but since Google limits API use, it would probably take very long to do it.
I had once contacted City of Bratislava about the street names list and they sent it to me as XLS. Maybe you could try doing that for your preferred cities.
I like Tom van Enckevort's suggestion, but I would be a little more specific that just looking inside the Planet.osm links, because most of them require the usage of some tool to deal with the supported formats (pbf, osm xml etc)
In fact, take a look at the following link
http://download.gisgraphy.com/openstreetmap/
The files there are all in .txt format and if it's only the street names that you want to use, just extract the second field (name) and you are done.
As an fyi, I didn't have any use for the French files in my project, but mining the German files resulted (after normalization) in a little more than 380K unique entries (~6 MB in size)
#dusoft might be onto something - maybe someone at a government level can help? I don't think that a simple list of street names cannot be copyrighted, nor any royalties be charged. If that is the case, maybe you could even scrape some mapping data from something like a TomTom?
The "Deutsche Post" offers a list with all street names in Germany:
http://www.deutschepost.de/dpag?xmlFile=link1015590_3877
They don't mention the price, but I reckon it's not for free.

Resources