pyarrow dataset partitioning by filenames converting filename to field/column name - parquet

Is there a way to use the filename in a dataset and have it be the column.
ie if the directory has
file1.parquet
file2.parquet
file3.parquet
can loading that as a dataset then have a column with the values file1, file2, and file3?
or does it only work with directory names? It seems to only work with directory names, is that right?

Support for filename-based partitioning will be in Arrow 8.0.0, which will likely release later this month or in May 2022. See ARROW-14612. The same goes for being able to have a column with the filename, see ARROW-15281.

Related

Power Query – File names loaded from folder become column names, causing failure if new files are later loaded

Power Query sourcing multiple Excel files from a folder.
Files are monthly transactions. The month and year are part of the file names. When the next month comes, new files (in the same format of course, but with new file names) replace the previous ones in the source folder. Having the new file names causes the query to fail on refresh in the following way.
When the files are combined and displayed to begin the transformations, the files names constitute a column of data (named Source). One of my steps in transforming the data is to “use first row as headers”; at this point the first file name in that Source column becomes its column header name.
The problem is that when files having new names replace the previous ones, that column name is no longer found, since the row promoted to be the column header is the name of a new file. PQ is looking for a column header having the original file name and doesn’t find it, so subsequent transformations using that column cause errors.
The error message is: “[Expression.Error] The column ‘[OriginalFileName]’ of the table wasn’t found.”
Basically, that original file name takes on a permanent role as a column name that is part of the query.
I successfully managed to get around the problem by manually renaming all the columns instead of promoting the first data row to be the column headers. Now files with new names are processed without complaint. But this solution is clunky and I would like to keep the step of promoting the first row to be the header.
Does anyone know how to overcome this problem?

Removing last 4 characters from multiple columns in RStudio

I am new to programming/coding and new to RStudio.
I am working with a dataset in RStudio, 'ethica_surveys'. Three columns within my dataset are contain data that is date, time, time zone - i.e., '2018-06-15 11:49:22 CST'. I want to remove the CST from each of these columns.
I first tried this :
str_sub(ethica_surveys$schedule_time,1,str_length(ethica_surveys$schedule_time)-4)
It worked, but only showed me the newly edited column in my console, my dataset did not change.
I then tried:
ethica_surveys <- str_sub(ethica_surveys$schedule_time,1,str_length(ethica_surveys$schedule_time)-4)
This changed the column in my dataset, but also seemed to erase all the other columns in the dataset.
I want to erase the CST (last 4 characters) in each of these three columns: schedule_time, issued_time, and response_time. I want this change to be reflected in my dataset, without erasing the other columns within the dataset. Can anyone advise as to how this could be done?
Thank you.
Assign the output of your transformation to your variable:
ethica_surveys$schedule_time <- str_sub(ethica_surveys$schedule_time,1,str_length(ethica_surveys$schedule_time)-4)

Insert image into a csv column ruby

I'm currently doing a crawler for a website, and my goal is to have a CSV, with a name in the first column and an image the second one, which is inserted with a Ruby script using the CSV#open method.
I have already used this method but I don't know, and I don't find information about the problematic that is to insert an image into a column.
Is it really possible? If not, which functionality would you use to have a list with string + image after crawling?
A CSV (Comma Separated Values) file is a TEXT file which as the name implies has various values separated by commas, expressed using plain ASCII, or sometimes unicode. It is intended as a light weight way to transfer tabular data between different computer systems or programs. You can use it to spit out a table in a database, or the VALUES in something like a spreadsheet. The normal convention is for the first row(line) of the file to contain names or labels that represent what that column contains, and then data in the subsequent rows.
As such, there really is no practical way to embed an image within a CSV file. This is not a limitation of Ruby or Watir, but a limitation of textfiles which spans pretty much all languages and operating systems.
To do what you want you would be better off to save the images into a specific directory using unique filenames and insert those filenames into the CSV file.

rename windows files - reformat date

I have hundreds of files (pictures) that were loaded with the date as part of the name. However, the file names are currently in the format "MM-DD-YY xxxxxxxx.jpg". I would like to rename them to the format "YYYY-MM-DD xxxxxxxxx.jpg", so they can sort better.
Any ideas? I'm running Windows 10.
Thank you,
Luis
Select all the files you want to rename from one date.
Eg : Select all pictures that were taken on 17-10-2016.
Hit F2 to rename.
Now enter the new date format ( i.e 2016-10-17 in our case ).
After reanaming, Hit Enter.
All your files are renamed into the format in which you need.
Similarly, do it for pictures with similar dates.

Pentaho Data Integration (DI) Get Last File in a Directory of a SFTP Server

I am doing a transformation on Pentaho Data Integration and I have a list of files in a directory of my SFTP server. This files are named with FILE_YYYYMMDDHHIISS.txt format, my directory looks like that:
mydirectory
FILE_20130701090000.txt
FILE_20130701170000.txt
FILE_20130702090000.txt
FILE_20130702170000.txt
FILE_20130703090000.txt
FILE_20130703170000.txt
My problem is that I need get the last file of this list in accordance of its creation date, to pass it to other transformation step...
How can I do this in Pentaho Data Integration?
In fact this is quite simple because your file names can be sorted textually, and the max in the sort list will be your most recent file.
Since a list of files is likely short, you can use a Memory Group by step. A grouping step needs a separate column by which to aggregate. If you only have column and you want to find the max in the entire set, you can add a grouping column with an Add Constants step, and configure it to add a column with, say an integer 1 in every row.
Configure your Memory Group by to group on the column of 1s, and use the file name column as the subject. Then simply select the Maximum grouping type. This will produce a single row with your grouping column, the file name field removed and the aggregate column containing your max file name. It would look something like this:

Resources