Reading from multiple TFRecord files - tensorflow-datasets

I am using multiple tfRecord files and want to read from them to create datasets. I am trying to use paths from_tensor_slices and use that dataset to further read TFRecords
(Advantages of multiple tfRecords : https://datascience.stackexchange.com/questions/16318/what-is-the-benefit-of-splitting-tfrecord-file-into-shards)
I want to know if there is an easier and proven method to do this.
file_names_dataset = tf.data.Dataset.from_tensor_slices(filenames_full)
def read(inp):
return tf.data.TFRecordDataset(inp)
file_content = file_names.map(read)
My next step would be to parse the dataset using tf.io.parse_single_example for example.

The tf.data.TFRecordDataset constructor already accepts a list or a tensor of filenames. Hence, you can call it directly with your filenames: file_content = tf.data.TFRecordDataset(filenames_full)
From the tf.io.parse_single_example documentation:
One might see performance advantages by batching Example protos with parse_example instead of using this function directly.
Hence, I would recommend to batch your dataset before mapping the tf.io.parse_example function over it:
tf.data.TFRecordDataset(
filenames_full
).batch(
my_batch_size
).map(
lambda batch: tf.io.parse_example(batch, my_features)
)
If you want a complete example, in this post I share my input pipeline (reading from many TFRecord files).
Kind, Alexis.

Related

Automate downloading of multiple xml files from web service with power query

I want to download multiple xml files from web service API. I have a query that gets a JSON document:
= Json.Document(Web.Contents("http://reports.sem-o.com/api/v1/documents/static-reports?DPuG_ID=BM-086&page_size=100"))
and manipulates it to get list of file names such as: PUB_DailyMeterDataD1_201812041627.xml in a column on an excel spreadsheet.
I hoped to get a function to run against this list of names to get all the data, so first I worked on one file: PUB_DailyMeterDataD1_201812041627
= Xml.Tables(Web.Contents("https://reports.sem-o.com/documents/PUB_DailyMeterDataD1_201812041627.xml"))
This gets an xml table which I manipulate to get the data I want (the half hourly metered MWh for generator GU_401970
Now I want to change the query into a function to automate the process across all xml files avaiable from the service. The function requires a variable to be substituted for the filename. I try this as preparation for the function:
let
Filename="PUB_DailyMeterDataD1_201812041627.xml",
Source = (Web.Contents("https://reports.sem-o.com/documents/Filename")),
(followed by the manipulating Mcode)
This doesnt work.
then this:
let
Filename="PUB_DailyMeterDataD1_201812041627.xml",
Source = Xml.Tables(Web.Contents("https://reports.sem-o.com/documents/[Filename]")),
I get:
DataFormat.Error: Xml processing failed. Either the input is invalid or it isn't supported. (Internal error: Data at the root level is invalid. Line 1, position 1.)
Details:
Binary
So stuck here. Can you help.
thanks
Conor
You append strings with the "&" symbol in Power Query. [Somename] is the format for referencing a field within a table, a normal variable is just referenced with it's name. So in your example
let Filename="PUB_DailyMeterDataD1_201812041627.xml",
Source = Xml.Tables(Web.Contents("https://reports.sem-o.com/documents/" & Filename)),
Would work.
It sounds like you have an existing query that drills down to a list of filenames and you are trying to use that to import them from the url though, so assuming that the column you have gotten the filenames from is called "Filename" then you could add a custom column with this in it
Xml.Tables(Web.Contents("https://reports.sem-o.com/documents/" & [Filename]))
And it will load the table onto the row of each of the filenames.

Using Kiba: Is it possible to define and run two pipelines in the same file? Using an intermediate destination & a second source

My processing has a "condense" step before needing further processing:
Source: Raw event/analytics logs of various users.
Transform: Insert each row into a hash according to UserID.
Destination / Output: An in-memory hash like:
{
"user1" => [event, event,...],
"user2" => [event, event,...]
}
Now, I've got no need to store these user groups anywhere, I'd just like to carry on processing them. Is there a common pattern with Kiba for using an intermediate destination? E.g.
# First pass
source EventSource # 10,000 rows of single events
transform {|row| insert_into_user_hash(row)}
#users = Hash.new
destination UserDestination, users: #users
# Second pass
source UserSource, users: #users # 100 rows of grouped events, created in the previous step
transform {|row| analyse_user(row)}
I'm digging around the code and it appears that all transforms in a file are applied to the source, so I was wondering how other people have approached this, if at all. I could save to an intermediate store and run another ETL script, but was hoping for a cleaner way - we're planning lots of these "condense" steps.
To directly answer your question: you cannot define 2 pipelines inside the same Kiba file. You can have multiple sources or destinations, but the rows will all go through each transform, and through each destination too.
That said you have quite a few options before resorting to splitting into 2 pipelines, depending on your specific use case.
I'm going to email you to ask a few more detailed questions in private, in order to properly reply here later.

Can we store multiple objects in file?

I am already familiar with How can I save an object to a file?
But what if we have to store multiple objects (say hashes) to a file.
I tried appending YAML.dump(hash) to a file from various locations in my code. But the difficult part is reading it back. As yaml dump can extend to many lines, do I have to parse the file? Also this will only complicate code. Is there a better way to achieve this?
PS: Same issue will persist with Marshal.dump. So I prefer YAML as its more human readable.
YAML.dump creates a single Yaml document. If you have several Yaml documents together in a file then you have a Yaml stream. So when you appended the results from several calls to YAML.dump together you would have had a stream.
If you try reading this back using YAML.load you will only get the first document. To get all the documents back you can use YAML.load_stream, which will give you an array with an entry for each of the documents.
An example:
f = File.open('data.yml', 'w')
YAML.dump({:foo => 'bar'}, f)
YAML.dump({:baz => 'qux'}, f)
f.close
After this data.yml will look like this, containing two separate documents:
---
:foo: bar
---
:baz: qux
You can now read it back like this:
all_docs = YAML.load_stream(File.open('data.yml'))
Which will give you an array like [{:foo=>"bar"}, {:baz=>"qux"}].
If you don’t want to load all the documents into an array in one go you can pass a block to load_stream and handle each document as it is parsed:
YAML.load_stream(File.open('data.yml')) do |doc|
# handle the doc here
end
You could manage to save multiple objects by creating a delimiter (something to mark that one object is finished and that you go to the next one). You could then process the file in two steps:
read the file, splitting it around each delimiter
use YAML to restore the hashes from each chunk
Now, this would be a bit cumbersome, as there is a much simpler solution. Let's say you have three hash to save:
student = { first_name: "John"}
restaurant = { location: "21 Jump Street" }
order = { main_dish: "Happy Meal" }
You can simply put them in an array and then dump them:
objects = [student, restaurant, order]
dump = YAML.dump(objects)
You can restore your objects easily:
saved_objects = YAML.load(dump)
saved_student = saved_objects[0]
Depending of your objects relationship, you may prefer to use an Hash to save them instead of an array (so that you can name them instead of depending on the order).

Pig load files using tuple's field

I need help for following use case:
Initially we load some files and and process those records (or more technically tuples). After this processing, finally we have tuples of the form:
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000, some_field_3)
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000, some_field_3)
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00001, some_field_3)
So basically, tuples has file path as value of its field (We can obviously transform this tuple having only one field having file path as value OR to a single tuple having only one field with some delimiter (say comma) separated string).
So now I have to load these files in Pig script, but I am not able to do so. Could you please suggest how to proceed further. I thought of using advanced foreach operator and tried as follows:
data = foreach tuples_with_file_info {
fileData = load $2 using PigStorage(',');
....
....
};
However its not working.
Edit:
For simplicity lets assume, I have single tuple with one field having file name:
(hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000)
You can't use Pig out of the box to do it.
What I would do is use some other scripting language (bash, Python, Ruby...) to read the file from hdfs and concatenate the files into a single string that you can then push as a parameter to a Pig script to use in your LOAD statement. Pig supports globbing so you can do the following:
a = LOAD '{hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}' ...
so all that's left to do is read the file that contains those file names, concatenate them into a glob such as:
{hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}
and pass that as a parameter to Pig so your script would start with:
a = LOAD '$input'
and your pig call would look like this:
pig -f script.pig -param input={hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}
First, store the tuples_with_file_info into some file:
STORE tuples_with_file_info INTO 'some_temporary_file';
then,
data = LOAD 'some_temporary_file' using MyCustomLoader();
where
MyCustomLoader is nothing but a Pig loader extending LoadFunc, which uses MyInputFormat as InputFormat.
MyInputFormat is an encapsulation over the actual InputFormat (e.g. TextInputFormat) which has to be used to read actual data from the files (e.g. in my case from file hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000).
In MyInputFormat, override getSplits method; first read the actual file name(s) from the some_temporary_file (You have to get this file name from Configuration's mapred.input.dir property), then update the same Configuration mapred.input.dir with retrieved file names, then return result from wrapped up InputFormat (e.g. in my case TextInputFormat).
Note: 1. You cannot use the setLocation API from the LoadFunc (or some other similar API) to read the contents of some_temporary_file, as its contents will be available only at run time.
2. One doubt may arise in your mind, what if LOAD statement executes before STORE? But this would not happen because if STORE and LOAD use same file in the script, Pig ensures that the jobs are executed in the right sequence. For more detail you may read section Store-load sequences on Pig Wiki

Pass data from workspace to a function

I created a GUI and used uiimport to import a dataset into matlab workspace, I would like to pass this imported data to another function in matlab...How do I pass this imported dataset into another function....I tried doing diz...but it couldnt pick diz....it doesnt pick the data on the matlab workspace....any ideas??
[file_input, pathname] = uigetfile( ...
{'*.txt', 'Text (*.txt)'; ...
'*.xls', 'Excel (*.xls)'; ...
'*.*', 'All Files (*.*)'}, ...
'Select files');
uiimport(file_input);
M = dlmread(file_input);
X = freed(M);
I think that you need to assign the result of this statement:
uiimport(file_input);
to a variable, like this
dataset = uiimport(file_input);
and then pass that to your next function:
M = dlmread(dataset);
This is a very basic feature of Matlab, which suggests to me that you would find it valuable to read some of the on-line help and some of the documentation for Matlab. When you've done that you'll probably find neater and quicker ways of doing this.
EDIT: Well, #Tim, if all else fails RTFM. So I did, and my previous answer is incorrect. What you need to pass to dlmread is the name of the file to read. So, you either use uiimport or dlmread to read the file, but not both. Which one you use depends on what you are trying to do and on the format of the input file. So, go RTFM and I'll do the same. If you are still having trouble, update your question and provide details of the contents of the file.
In your script you have three ways to read the file. Choose one on them depending on your file format. But first I would combine file name with the path:
file_input = fullfile(pathname,file_input);
I wouldn't use UIIMPORT in a script, since user can change way to read the data, and variable name depends on file name and user.
With DLMREAD you can only read numerical data from the file. You can also skip some number of rows or columns with
M = dlmread(file_input,'\t',1,1);
skipping the first row and one column on the left.
Or you can define a range in kind of Excel style. See the DLMREAD documentation for more details.
The filename you pass to DLMREAD must be a string. Don't pass a file handle or any data. You will get "Filename must be a string", if it's not a string. Easy.
FREAD reads data from a binary file. See the documentation if you really have to do it.
There are many other functions to read the data from file. If you still have problems, show us an example of your file format, so we can suggest the best way to read it.

Resources