Automate downloading of multiple xml files from web service with power query - powerquery

I want to download multiple xml files from web service API. I have a query that gets a JSON document:
= Json.Document(Web.Contents("http://reports.sem-o.com/api/v1/documents/static-reports?DPuG_ID=BM-086&page_size=100"))
and manipulates it to get list of file names such as: PUB_DailyMeterDataD1_201812041627.xml in a column on an excel spreadsheet.
I hoped to get a function to run against this list of names to get all the data, so first I worked on one file: PUB_DailyMeterDataD1_201812041627
= Xml.Tables(Web.Contents("https://reports.sem-o.com/documents/PUB_DailyMeterDataD1_201812041627.xml"))
This gets an xml table which I manipulate to get the data I want (the half hourly metered MWh for generator GU_401970
Now I want to change the query into a function to automate the process across all xml files avaiable from the service. The function requires a variable to be substituted for the filename. I try this as preparation for the function:
let
Filename="PUB_DailyMeterDataD1_201812041627.xml",
Source = (Web.Contents("https://reports.sem-o.com/documents/Filename")),
(followed by the manipulating Mcode)
This doesnt work.
then this:
let
Filename="PUB_DailyMeterDataD1_201812041627.xml",
Source = Xml.Tables(Web.Contents("https://reports.sem-o.com/documents/[Filename]")),
I get:
DataFormat.Error: Xml processing failed. Either the input is invalid or it isn't supported. (Internal error: Data at the root level is invalid. Line 1, position 1.)
Details:
Binary
So stuck here. Can you help.
thanks
Conor

You append strings with the "&" symbol in Power Query. [Somename] is the format for referencing a field within a table, a normal variable is just referenced with it's name. So in your example
let Filename="PUB_DailyMeterDataD1_201812041627.xml",
Source = Xml.Tables(Web.Contents("https://reports.sem-o.com/documents/" & Filename)),
Would work.
It sounds like you have an existing query that drills down to a list of filenames and you are trying to use that to import them from the url though, so assuming that the column you have gotten the filenames from is called "Filename" then you could add a custom column with this in it
Xml.Tables(Web.Contents("https://reports.sem-o.com/documents/" & [Filename]))
And it will load the table onto the row of each of the filenames.

Related

How to separate tables contained in an excel file in different CSV?

I have an excel with multiple tables separated by blank rows and I want to save each table in separate CSV files with a script. How could I do it?
Thanks for the help
UPDATE:
INPUT EXAMPLE:
Excel Example
OUTPUT EXAMPLE:
I want everyone of those columns in a file like this one.
276.1722 54.318 50.6335
276.373 52.573 51.4047
277.0097 50.864 51.9912
277.9329 49.4127 52.8294
279.0832 47.9623 53.3041
280.3554 46.5477 53.5295
281.3679 44.9695 53.8862
282.4689 43.4235 54.1254
283.4763 41.8019 54.0885
284.5859 40.3595 53.5828
285.7263 38.941 52.988
286.8929 37.5684 52.3438
288.0729 36.2914 51.5373
289.0561 35.1335 50.4119
289.7246 34.2113 48.8901
290.0624 33.3207 47.2446
290.1395 32.2516 45.6541
290.0895 31.2818 44.0091
289.7804 30.5224 42.2812
289.211 29.8383 40.5862
Alternate way of storing the data:
You could actually store them in an excel file and in different sheets and use these gems depending on if your working with old or new excel:
'Roo', 'roo-xls', 'spreadsheet', 'write_xlsx'
Loop through the sheets and perform the same logic instead of placing them throughout a single sheet.

Pig load files using tuple's field

I need help for following use case:
Initially we load some files and and process those records (or more technically tuples). After this processing, finally we have tuples of the form:
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000, some_field_3)
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000, some_field_3)
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00001, some_field_3)
So basically, tuples has file path as value of its field (We can obviously transform this tuple having only one field having file path as value OR to a single tuple having only one field with some delimiter (say comma) separated string).
So now I have to load these files in Pig script, but I am not able to do so. Could you please suggest how to proceed further. I thought of using advanced foreach operator and tried as follows:
data = foreach tuples_with_file_info {
fileData = load $2 using PigStorage(',');
....
....
};
However its not working.
Edit:
For simplicity lets assume, I have single tuple with one field having file name:
(hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000)
You can't use Pig out of the box to do it.
What I would do is use some other scripting language (bash, Python, Ruby...) to read the file from hdfs and concatenate the files into a single string that you can then push as a parameter to a Pig script to use in your LOAD statement. Pig supports globbing so you can do the following:
a = LOAD '{hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}' ...
so all that's left to do is read the file that contains those file names, concatenate them into a glob such as:
{hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}
and pass that as a parameter to Pig so your script would start with:
a = LOAD '$input'
and your pig call would look like this:
pig -f script.pig -param input={hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}
First, store the tuples_with_file_info into some file:
STORE tuples_with_file_info INTO 'some_temporary_file';
then,
data = LOAD 'some_temporary_file' using MyCustomLoader();
where
MyCustomLoader is nothing but a Pig loader extending LoadFunc, which uses MyInputFormat as InputFormat.
MyInputFormat is an encapsulation over the actual InputFormat (e.g. TextInputFormat) which has to be used to read actual data from the files (e.g. in my case from file hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000).
In MyInputFormat, override getSplits method; first read the actual file name(s) from the some_temporary_file (You have to get this file name from Configuration's mapred.input.dir property), then update the same Configuration mapred.input.dir with retrieved file names, then return result from wrapped up InputFormat (e.g. in my case TextInputFormat).
Note: 1. You cannot use the setLocation API from the LoadFunc (or some other similar API) to read the contents of some_temporary_file, as its contents will be available only at run time.
2. One doubt may arise in your mind, what if LOAD statement executes before STORE? But this would not happen because if STORE and LOAD use same file in the script, Pig ensures that the jobs are executed in the right sequence. For more detail you may read section Store-load sequences on Pig Wiki

How to load a file with a JSON array per line in Pig Latin

An existing script creates text files with an array of JSON objects per line, e.g.,
[{"foo":1,"bar":2},{"foo":3,"bar":4}]
[{"foo":5,"bar":6},{"foo":7,"bar":8},{"foo":9,"bar":0}]
…
I would like to load this data in Pig, exploding the arrays and processing each individual object.
I have looked at using the JsonLoader in Twitter’s Elephant Bird to no avail. It doesn’t complain about the JSON, but I get “Successfully read 0 records” when running the following:
register '/tmp/elephant-bird/core/target/elephant-bird-core-4.3-SNAPSHOT.jar';
register '/tmp/elephant-bird/hadoop-compat/target/elephant-bird-hadoop-compat-4.3-SNAPSHOT.jar';
register '/tmp/elephant-bird/pig/target/elephant-bird-pig-4.3-SNAPSHOT.jar';
register '/usr/local/lib/json-simple-1.1.1.jar';
a = load '/path/to/file.json' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true');
dump a;
I have also tried loading the file as normal, treating each line as a containing a single column chararray, and then trying to parse that as JSON, but I can’t find a pre-existing UDF which seems to do the trick.
Any ideas?
Like Donald said, you should use a UDF here. Here in Xplenty we wrote JsonStringToBag to complement ElephantBird's JsonStringToMap.

Extract a specific node from an XML file

I want to extract only the body node/tag from an XML file using doc.xpath in Ruby
The node to extract from the XML file:
<wcm:element name="Body"><p>A new study suggests that <a href="ssNODELINK/SmokingAndCancer">tobacco</a> companies may be using online video portals, such as YouTube, to get around advertising restrictions and market their products to young people.</p>
</wcm:element>
I have tried the following:
page_content = doc.xpath("/wcm:root/wcm:element").inner_text
But this extracts every node everything
Then I tried this:
page_content = doc.xpath("/wcm:root/wcm:element/Body")
But does not work.
Anyone has any suggestions how to extract exactly the body section of an XML file using doc.xpath in Ruby?
I'm not 100% certain I've understood what you mean but… let's not let that stop us. You want to get the content of a particular node from the input. Your first XPath statement:
/wcm:root/wcm:element
is extracting every element with name wcm:element that is a child of the wcm:root element which is the root element.
Your second:
/wcm:root/wcm:element/Body
is similar but looks for elements with name Body which are children of the wcm:element.
What you need to is to get the values of the wcm:element element where the attribute name is set to the value Body. You access attributes in XPath by prefixing them with an # sign and to express a where condition you use [...] - a predicate. You XPath statement needs to be:
/wcm:root/wcm:element[#name = 'Body']
I'm assuming that your XPath execution environment is fine the namespace prefixes (wcm) because you say that your first query returned content.

Pass data from workspace to a function

I created a GUI and used uiimport to import a dataset into matlab workspace, I would like to pass this imported data to another function in matlab...How do I pass this imported dataset into another function....I tried doing diz...but it couldnt pick diz....it doesnt pick the data on the matlab workspace....any ideas??
[file_input, pathname] = uigetfile( ...
{'*.txt', 'Text (*.txt)'; ...
'*.xls', 'Excel (*.xls)'; ...
'*.*', 'All Files (*.*)'}, ...
'Select files');
uiimport(file_input);
M = dlmread(file_input);
X = freed(M);
I think that you need to assign the result of this statement:
uiimport(file_input);
to a variable, like this
dataset = uiimport(file_input);
and then pass that to your next function:
M = dlmread(dataset);
This is a very basic feature of Matlab, which suggests to me that you would find it valuable to read some of the on-line help and some of the documentation for Matlab. When you've done that you'll probably find neater and quicker ways of doing this.
EDIT: Well, #Tim, if all else fails RTFM. So I did, and my previous answer is incorrect. What you need to pass to dlmread is the name of the file to read. So, you either use uiimport or dlmread to read the file, but not both. Which one you use depends on what you are trying to do and on the format of the input file. So, go RTFM and I'll do the same. If you are still having trouble, update your question and provide details of the contents of the file.
In your script you have three ways to read the file. Choose one on them depending on your file format. But first I would combine file name with the path:
file_input = fullfile(pathname,file_input);
I wouldn't use UIIMPORT in a script, since user can change way to read the data, and variable name depends on file name and user.
With DLMREAD you can only read numerical data from the file. You can also skip some number of rows or columns with
M = dlmread(file_input,'\t',1,1);
skipping the first row and one column on the left.
Or you can define a range in kind of Excel style. See the DLMREAD documentation for more details.
The filename you pass to DLMREAD must be a string. Don't pass a file handle or any data. You will get "Filename must be a string", if it's not a string. Easy.
FREAD reads data from a binary file. See the documentation if you really have to do it.
There are many other functions to read the data from file. If you still have problems, show us an example of your file format, so we can suggest the best way to read it.

Resources