I'm working with Tensorflow/Keras. I have two text files (train_{modality_name}.txt and val_{modality_name}.txt). They contain the split I want to use for the images I'm processing.
The format of these files is the following:
example_0_path category_id
example_1_path category_id
...
example_N_path category_id
and my folder structure is like this:
/labels
train_X.txt
val_X.txt
/data
/modality_1
...
/modality_M
(e.g. data/sketch/abbey/id)
How can I make use of the files?
'flow_from_dataframe' did the job, additionally it was necessary to preprocess the txt with pandas. This tutorial was very helpful: https://medium.com/#vijayabhaskar96/tutorial-on-keras-imagedatagenerator-with-flow-from-dataframe-8bd5776e45c1
Still having problems with matching target size of the arrays (labels seem to have the wrong format)
Related
I'm using a collection called now that takes a date format and lists it newest to oldest. I've determined naming my files _now/YYYY-MM-DD-title.md (just like posts) will automatically assign dates to the collection, but I'm required to add a title to it. This means my files look like:
_now/
2019-02-28-now.md
2019-01-15-now.md
2019-01-01-now.md
...
I'd prefer if I didn't need to append -now to each file.
Does anyone know how to configure the filename format of a collection? I'm hoping for something in the configuration options, maybe like:
collections:
now:
filename: date
I'm essentially looking for the opposite of this: Jekyll Filename Without Date
I want to download multiple xml files from web service API. I have a query that gets a JSON document:
= Json.Document(Web.Contents("http://reports.sem-o.com/api/v1/documents/static-reports?DPuG_ID=BM-086&page_size=100"))
and manipulates it to get list of file names such as: PUB_DailyMeterDataD1_201812041627.xml in a column on an excel spreadsheet.
I hoped to get a function to run against this list of names to get all the data, so first I worked on one file: PUB_DailyMeterDataD1_201812041627
= Xml.Tables(Web.Contents("https://reports.sem-o.com/documents/PUB_DailyMeterDataD1_201812041627.xml"))
This gets an xml table which I manipulate to get the data I want (the half hourly metered MWh for generator GU_401970
Now I want to change the query into a function to automate the process across all xml files avaiable from the service. The function requires a variable to be substituted for the filename. I try this as preparation for the function:
let
Filename="PUB_DailyMeterDataD1_201812041627.xml",
Source = (Web.Contents("https://reports.sem-o.com/documents/Filename")),
(followed by the manipulating Mcode)
This doesnt work.
then this:
let
Filename="PUB_DailyMeterDataD1_201812041627.xml",
Source = Xml.Tables(Web.Contents("https://reports.sem-o.com/documents/[Filename]")),
I get:
DataFormat.Error: Xml processing failed. Either the input is invalid or it isn't supported. (Internal error: Data at the root level is invalid. Line 1, position 1.)
Details:
Binary
So stuck here. Can you help.
thanks
Conor
You append strings with the "&" symbol in Power Query. [Somename] is the format for referencing a field within a table, a normal variable is just referenced with it's name. So in your example
let Filename="PUB_DailyMeterDataD1_201812041627.xml",
Source = Xml.Tables(Web.Contents("https://reports.sem-o.com/documents/" & Filename)),
Would work.
It sounds like you have an existing query that drills down to a list of filenames and you are trying to use that to import them from the url though, so assuming that the column you have gotten the filenames from is called "Filename" then you could add a custom column with this in it
Xml.Tables(Web.Contents("https://reports.sem-o.com/documents/" & [Filename]))
And it will load the table onto the row of each of the filenames.
I have an excel with multiple tables separated by blank rows and I want to save each table in separate CSV files with a script. How could I do it?
Thanks for the help
UPDATE:
INPUT EXAMPLE:
Excel Example
OUTPUT EXAMPLE:
I want everyone of those columns in a file like this one.
276.1722 54.318 50.6335
276.373 52.573 51.4047
277.0097 50.864 51.9912
277.9329 49.4127 52.8294
279.0832 47.9623 53.3041
280.3554 46.5477 53.5295
281.3679 44.9695 53.8862
282.4689 43.4235 54.1254
283.4763 41.8019 54.0885
284.5859 40.3595 53.5828
285.7263 38.941 52.988
286.8929 37.5684 52.3438
288.0729 36.2914 51.5373
289.0561 35.1335 50.4119
289.7246 34.2113 48.8901
290.0624 33.3207 47.2446
290.1395 32.2516 45.6541
290.0895 31.2818 44.0091
289.7804 30.5224 42.2812
289.211 29.8383 40.5862
Alternate way of storing the data:
You could actually store them in an excel file and in different sheets and use these gems depending on if your working with old or new excel:
'Roo', 'roo-xls', 'spreadsheet', 'write_xlsx'
Loop through the sheets and perform the same logic instead of placing them throughout a single sheet.
Please clarify
I have set of input files (say 10) with specific names. I run word count job on all files at once (input path is folder). I am expecting 10 output files with same names as input files. I.e. File1 input should be counted and should be stored in a separate output file with "file1" name. And so on to all files.
There are 2 approaches you can take to achieve multiple outputs
Use MultipleOutputs class - refer this document for information about multipleclassoutput (https://hadoop.apache.org/docs/r2.6.3/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html) , for more information about how to implement refer this http://appsintheopen.com/posts/44-map-reduce-multiple-outputs
Another option is using LazyOuputFormat, however, this is used in conjunction with multipleoutputs, for more information about its implementation refer this ( https://ssmolen.wordpress.com/2014/07/09/hadoop-mapreduce-write-output-to-multiple-directories-depending-on-the-reduce-key/ ).
I feel using LazyOutputFormat in conjunction with MultipleOuputs class is better approach.
Set the number of reduce tasks to be equal to the number of input files. This will create the given number of output files, as well.
Add a file prefix to each map output key (word). E.g., when you meet the word "cat" in file named "file0.txt" you can emit the key "0_cat", or "file0_cat", or anything else that is unique for "file0.txt". Use the context to get each time the filename.
Override the default Partitioner, to make sure that all the map output keys with prefix "0_", or "file0_" will go to the first partition, all the keys with prefix "1_", or "file1_" will go to the second, etc.
In the reducer, remove the "x_" or "filex_" prefix from the output key and use it as the name of the output file (using MultipleOutputs). Otherwise, if you don't want MultipleOutputs, you can easily do the mapping between outputfiles and input files by checking your Partitioner code. (e.g., part-00000 will be the partition 0's output)
I have over 200,000 accessions in a flat file, which need to retrieve relevant entry from NBCI.
I use Batch Entrez (http://www.ncbi.nlm.nih.gov/sites/batchentrez) to do the job. But encountered several problems:
The initial file was splitted into multiple sub-files, each containing 4000 lines. But it seems Batch Entrez has some size limitation on the returned file. For example: if the first 1000 accessions all have tens of thousands lines which reach the size limitation, then the rest 3000 accessions will be rejected and won't be searched.
One possible solution in my head is to split the file into more sub-files and search individually. However this requires too much manual effort.
So I am just wondering if there is any other solution, or any code could be used.
Thanks in advance
Your problem sounds a good fit for a Bio-star toolkit. This is a solution using BioSmalltalk
| giList gbReader |
giList := (BioObject openFullFileNamed: 'd:\Batch_entrez_1.txt') contents lines.
gbReader := BioNCBIGenBankReader new.
gbReader
genBankRecordsFrom: 'nuccore'
format: #setModeXML
uids: giList.
(BioGBSeqCollection newFromXMLCollection: gbReader searchResults)
collect: [: e | BioParser
tokenizeNcbiXmlBlast: e contents
nodes: #('GBAuthor' 'GBSeq_definition') ]
To execute/debug the script, just select it and a right-click will open the Smalltalk world-menu.
The API automatically split and fetch your accession list (in the script contained in Batch_entrez_1.txt) maintaining the NCBI Entrez post limits to avoid penalities.
The result format is XML (which is an "easy" format to parse or filter specific fields) although it could be any of the retrieval modes supported by Entrez, for example setting #setModeText will answer an ASN.1 representation. Replace 'nuccore' for the database you want to query. Finally choose the interesting fields, in the script I have choosed 'GBAuthor' and 'GBSeq_definition', but you are free to choose anyone of the available nodes.