get most recent file from azure blob storage - azure-databricks

My azure blob storage has several files , like
name last modified
data-GUID1 jan 1,20
data_guid2 jan 2, 20
How would I grab the file most recent 'last modified' ,like data_guid2 ?
Currently I hard-code the name :
file_location= /dbfs/mnt/blob/container/data_Guid1
Thanks in advance.

You can get a list of all the file names, then write whatever custom code you want to find the most recent (i.e. find the one with the largest number at the end)
You can get this list using the dbutils.fs.ls("") function: https://kb.databricks.com/data-sources/wasb-check-blob-types.html
The url for your block container will have the following format:
wasbs://<containername>#<accountname>.blob.core.windows.net/<file.path>/
If you are having trouble with this approach or if you would like to also get the "last modified" timestamps for the files, check out this link for additional ways to list the files in a blob directory: https://kb.databricks.com/data-sources/wasb-check-blob-types.html

Related

How to change Jekyll filename format

I'm using a collection called now that takes a date format and lists it newest to oldest. I've determined naming my files _now/YYYY-MM-DD-title.md (just like posts) will automatically assign dates to the collection, but I'm required to add a title to it. This means my files look like:
_now/
2019-02-28-now.md
2019-01-15-now.md
2019-01-01-now.md
...
I'd prefer if I didn't need to append -now to each file.
Does anyone know how to configure the filename format of a collection? I'm hoping for something in the configuration options, maybe like:
collections:
now:
filename: date
I'm essentially looking for the opposite of this: Jekyll Filename Without Date

Keras - using predefined training / validation split

I'm working with Tensorflow/Keras. I have two text files (train_{modality_name}.txt and val_{modality_name}.txt). They contain the split I want to use for the images I'm processing.
The format of these files is the following:
example_0_path category_id
example_1_path category_id
...
example_N_path category_id
and my folder structure is like this:
/labels
train_X.txt
val_X.txt
/data
/modality_1
...
/modality_M
(e.g. data/sketch/abbey/id)
How can I make use of the files?
'flow_from_dataframe' did the job, additionally it was necessary to preprocess the txt with pandas. This tutorial was very helpful: https://medium.com/#vijayabhaskar96/tutorial-on-keras-imagedatagenerator-with-flow-from-dataframe-8bd5776e45c1
Still having problems with matching target size of the arrays (labels seem to have the wrong format)

Automate downloading of multiple xml files from web service with power query

I want to download multiple xml files from web service API. I have a query that gets a JSON document:
= Json.Document(Web.Contents("http://reports.sem-o.com/api/v1/documents/static-reports?DPuG_ID=BM-086&page_size=100"))
and manipulates it to get list of file names such as: PUB_DailyMeterDataD1_201812041627.xml in a column on an excel spreadsheet.
I hoped to get a function to run against this list of names to get all the data, so first I worked on one file: PUB_DailyMeterDataD1_201812041627
= Xml.Tables(Web.Contents("https://reports.sem-o.com/documents/PUB_DailyMeterDataD1_201812041627.xml"))
This gets an xml table which I manipulate to get the data I want (the half hourly metered MWh for generator GU_401970
Now I want to change the query into a function to automate the process across all xml files avaiable from the service. The function requires a variable to be substituted for the filename. I try this as preparation for the function:
let
Filename="PUB_DailyMeterDataD1_201812041627.xml",
Source = (Web.Contents("https://reports.sem-o.com/documents/Filename")),
(followed by the manipulating Mcode)
This doesnt work.
then this:
let
Filename="PUB_DailyMeterDataD1_201812041627.xml",
Source = Xml.Tables(Web.Contents("https://reports.sem-o.com/documents/[Filename]")),
I get:
DataFormat.Error: Xml processing failed. Either the input is invalid or it isn't supported. (Internal error: Data at the root level is invalid. Line 1, position 1.)
Details:
Binary
So stuck here. Can you help.
thanks
Conor
You append strings with the "&" symbol in Power Query. [Somename] is the format for referencing a field within a table, a normal variable is just referenced with it's name. So in your example
let Filename="PUB_DailyMeterDataD1_201812041627.xml",
Source = Xml.Tables(Web.Contents("https://reports.sem-o.com/documents/" & Filename)),
Would work.
It sounds like you have an existing query that drills down to a list of filenames and you are trying to use that to import them from the url though, so assuming that the column you have gotten the filenames from is called "Filename" then you could add a custom column with this in it
Xml.Tables(Web.Contents("https://reports.sem-o.com/documents/" & [Filename]))
And it will load the table onto the row of each of the filenames.

How to fetch all records using NCBI Batch Entrez

I have over 200,000 accessions in a flat file, which need to retrieve relevant entry from NBCI.
I use Batch Entrez (http://www.ncbi.nlm.nih.gov/sites/batchentrez) to do the job. But encountered several problems:
The initial file was splitted into multiple sub-files, each containing 4000 lines. But it seems Batch Entrez has some size limitation on the returned file. For example: if the first 1000 accessions all have tens of thousands lines which reach the size limitation, then the rest 3000 accessions will be rejected and won't be searched.
One possible solution in my head is to split the file into more sub-files and search individually. However this requires too much manual effort.
So I am just wondering if there is any other solution, or any code could be used.
Thanks in advance
Your problem sounds a good fit for a Bio-star toolkit. This is a solution using BioSmalltalk
| giList gbReader |
giList := (BioObject openFullFileNamed: 'd:\Batch_entrez_1.txt') contents lines.
gbReader := BioNCBIGenBankReader new.
gbReader
genBankRecordsFrom: 'nuccore'
format: #setModeXML
uids: giList.
(BioGBSeqCollection newFromXMLCollection: gbReader searchResults)
collect: [: e | BioParser
tokenizeNcbiXmlBlast: e contents
nodes: #('GBAuthor' 'GBSeq_definition') ]
To execute/debug the script, just select it and a right-click will open the Smalltalk world-menu.
The API automatically split and fetch your accession list (in the script contained in Batch_entrez_1.txt) maintaining the NCBI Entrez post limits to avoid penalities.
The result format is XML (which is an "easy" format to parse or filter specific fields) although it could be any of the retrieval modes supported by Entrez, for example setting #setModeText will answer an ASN.1 representation. Replace 'nuccore' for the database you want to query. Finally choose the interesting fields, in the script I have choosed 'GBAuthor' and 'GBSeq_definition', but you are free to choose anyone of the available nodes.

How to get the 'current observation data' from the NDFD (NOAA, NWS) REST service?

I'm trying to use the NDFD (National Digital Forecast Database) to get current temperature and relative humidity given a Lat and Long using their REST based service.
The issue at hand:
I can't match the 'current observation data' WITH the 'results' I get back from the REST-service.
The setup:
Location:
* Apple (1-infinite loop, Cupertino, California)
* Lat = 37.33; Lon = -122.03
If I issue the following REST-call:
http://www.weather.gov/forecasts/xml/sample_products/browser_interface/ndfdXMLclient.php?lat=37.33&lon=-122.03&product=time-series&begin=2009-06-21T17:12:35&end=2009-06-21T17:12:35&appt=appt&rh=rh&temp_r=temp_r&temp=temp
Note 1: I'm passing the begin and end time in UTC. They're the same because I'm
looking for just a single-point-in-time: the latest observed
temp and relative humidity.
AND, then compare it to what is the closet reporting stations (San Jose International Airport, CA - KSJC - 37.37N 121.93W) # http://www.weather.gov/xml/current_obs/KSJC.xml
** I can never get them to MATCH. **
Note 2: The nearest reporting station is return back from the REST call
as well, so I know I'm comparing Location apples to Location apples.
I've had two ideas:
1: I'm doing something wrong with how I'm passing in the begin/end times into the REST call...
2: You can't get 'current observed data' the way I'm trying to...
Lastly:
I've found a solution using outoftime's NOAA ruby lib , [it parses an observation stations YAML file to find the nearest one given Lat/Lng then goes directly to that station via its identifier i.e. http://www.weather.gov/xml/current_obs/KSJC.xml].... but it just feels like I may be missing something obvious here and would like to use the REST-based interface ;)
Any help or pointers would be appreciated!
Thanks!
It looks like the service you are calling isn't for current data. Judging by the URL and the XML results it seems to be for forecasts. You can also put in future dates to get future forecast data. It expects the dates to be in the -0700 time zone according to the response. I'm not sure which service you should be calling to get the data you want though.
I know that this is an old question, but this is what I'm using to get current weather conditions: http://forecast.weather.gov/MapClick.php?lat=43.09110&lon=-79.0162&unit=0&lg=english&FcstType=dwml
Found this api/link yesterday. Its still developmental (operation-mode="developmental"):
http://forecast.weather.gov/MapClick.php?lat=37.33&lon=-122.03&FcstType=dwml
If you want the "current" observation, you use the XML here:
http://w1.weather.gov/xml/current_obs/seek.php?state=or&Find=Find
e.g.,:
http://w1.weather.gov/xml/current_obs/KAST.xml
If you click on the link you'll get a rendered page. However, if you pull from it using normal rest methods or just wget, it delivers an xml file.

Resources