AutoLoader with a lot of empty parquet files - azure-databricks

I want to process some parquet files (with snappy compression) using AutoLoader in Databricks. A lot of those files are empty or contain just one record. Also, I cannot change how they are created, nor compact them.
Here are some of the approaches I tried so far:
I created a python notebook in Databricks and tried using AutoLoader to load the data. When I run it for a single table/folder, I can process it without a problem. However, when calling that notebook in a for loop for other tables (for item in active_tables_metadata: -> dbutils.notebook.run("process_raw", 0, item)) , I only get empty folders in the target.
I created a Databricks Workflows Job and called the same notebook for each table/folder (sending the name/path of the table/folder via a parameter). This way, every table/folder was processed.
I used DBX to package the python scripts into a wheel and use it inside Databricks Workflows Job tasks as entrypoints. When doing this, I managed to create the same workflow as in point 2 above but, instead of calling a notebook, I am calling a python script (specified in entypoint of the task). Unfortunately, this way I only get empty folder in the target.
Copied over to a python notebook in Databricks all functions used in DBX python wheel and ran the notebook for one table/folder. I only got an empty folder in the target.
I have set the following AutoLoader configurations:
"cloudFiles.tenantId"
"cloudFiles.clientId"
"cloudFiles.clientSecret"
"cloudFiles.resourceGroup"
"cloudFiles.subscriptionId"
"cloudFiles.format": "parquet"
"pathGlobFilter": "*.snappy"
"cloudFiles.useNotifications": True
"cloudFiles.includeExistingFiles": True
"cloudFiles.allowOverwrites": True
I use the following readStream configurations:
spark.readStream.format("cloudFiles")
.options(**CLOUDFILE_CONFIG)
.option("cloudFiles.format", "parquet")
.option("pathGlobFilter", "*.snappy")
.option("recursiveFileLookup", True)
.schema(schema)
.option("locale", "de-DE")
.option("dateFormat", "dd.MM.yyyy")
.option("timestampFormat", "MM/dd/yyyy HH:mm:ss")
.load(<path-to-source>)
And the following writeStream configurations:
df.writeStream.format("delta")
.outputMode("append")
.option("checkpointLocation", <path_to_checkpoint>)
.queryName(<processed_table_name>)
.partitionBy(<partition-key>)
.option("mergeSchema", True)
.trigger(once=True)
.start(<path-to-target>)
My prefered solution would be to use DBX but I don't know why the job is succeeding yet, I only see empty folders in the target location. This is very strange behavior because I think AutoLoader is timing out reading only empty files after some time!
P.S. the same is also happening when I use parquet spark streaming instead of AutoLoader.
Do you know of any reason why this is happening and how can I overcome this issue?

Are you specifying the schema of the streaming read? (Sorry, can't add comments yet)

Related

gensim w2k - additional file

I trained w2v on rather big (> 200 million sentences) corpus, and got, in addition to file w2v_model.model, files: w2v_model.model.trainables.syn1neg.npy and w2v.model_model.wv.vectors.npy. Model file was successfully loaded and read all npy files without any exceptions. The obtained model performed OK.
Now I retrained the model on much bigger corpus (> 1 billion sentences). The same 3 files were automatically saved, as expected.
When I try to load my new retrained model:
w2v_model = Word2Vec.load(path_filename)
I get:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/...../w2v_US.model.trainables.vectors_lockf.npy'
But no .npy file with such extension was saved by gensim at the end of the training
(I save all output files in the same library, as required).
What should I do to obtain such file as a part of output .npy files (may be some option in gensim w2v when training)? May be there are other ways to overcome this issue?
If a .save() is creating any files with the word trainables in it, you're using a older version fo Gensim. Any new training should definitely prefer using a current version. As of now (January 2022), that's gensim-4.1.2, released 2021-09.
If an attempt at a .load() generated that particular error, then there should've been that file, alongside the others you mention, created when the .save() had been done. (In fact, the only way that the main file you named with path_filename should be able to know that other filename is if that other file was written successfully, allowing the main file to complete writing.)
Are you sure that file wasn't written, but then somehow left behind, perhaps getting deleted or not moving alongside the other few files to some new filesystem path?
In general, I would suggest:
using latest Gensim for any new training
always enable Python logging at the INFO level, & watch the logging/console output of training/saving processes closely to see confirmation of expected activity/steps
keep all files from a .save() that begin with the same main filename (in your examples above, w2v_US.model) together - & keep in mind that for larger models it may be a larger roster of files than for a small test model
You will probably have to re-train the model, but you might be able to re-generate a compatible lockf file via steps like the following:
save aside all files of any potential use
from the exact same configuration as your original .save() – including the same outdated Gensim version, exact same model parameters, & exact same training corpus – repeat all the model-building steps you did before up through the .build_vocab() step. (That is: no extra need to .train().) This will create an untrained dummy model that should exactly match the vocabulary 'shape' of your broken model.
use .save() to save that dummy model again - watching the logs/output for errors. There should be, alongside the other files, a file with a name like dummy.model.trainables.vectors_lockf.npy. If so, you might be able to copy that away, rename it to tbe the file expected by the original model whose load failed, then leave it alongside that original model - and the .load() might then succeed, or fail in a different way.
(If there were other problems/corruption at the time of the original model creation, this might not work. In particular, I wonder if when you talk about retraining the model, you didn't start with a fresh Word2Vec instance, but somehow expanded the older one, which might've added other problems/complications. In that case, a full retraining, ideally in the latest Gensim, would be necessary, and also a better basis for going forward.)

Misinformation in DataStage XML Export

As the title suggests, I am just trying to do a simple export of a datastage job. The issue occurs when we export the XML and begin examination. For some reason, the wrong information is being pulled from the job and placed in the XML.
As an example the SQL in a transform of the job may be:
SELECT V1,V2,V3 FROM TABLE_1;
Whereas the XML for the same transform may produce:
SELECT V1,Y6,Y9 FROM TABLE_1,TABLE_2;
It makes no sense to me how the export of a job could be different then the actual architecture.
The parameters I am using to export are:
Exclude Read Only Items: No
Include Dependent Items: Yes
Include Source Code with Routines: Yes
Include Source Code with Job Executable: Yes
Include Source Content with Data Quality Specifications: No
What tool are you using to view the XML? Try using something less smart, such as Notepad or Wordpad. This will determine/eliminate whether the problem is with your XML viewer.
You might also try exporting in DSX format and examining that output, to see whether the same symptoms are visible there.
Thank you all for the feedback. I realized that the issue wasn't necessarily with the XML. It had to do with numerous factors within our data stage environment. As mentioned above, the data connections were old and unreliable. For some reason this does not impact our current production refresh, so it's a non issue.
The other issue was the way that the generated SQL and custom SQL options work when creating the XML. In my case, there were times when old code was kept in the system, but the option was switched from custom code to generate SQL based on columns. This lead to inconsistent output from my script. Thus the mini project was scrapped.

naming convention of part files in HDFS

When we do an INSERT INTO command in Hive, the result of the execution creates multiple part files in HDFS.
e.g. part-*-***** or 000000_0,000001_0 etc or something else.
Is there a configuration/setting that controls the naming of these part files?
The cluster I work in creates 000000_0, 000001_0, 000000_1 etc. I would like to change this to part- or text- etc so that its easier for me to pick these files up and merge them if needed.
If there is a setting that can be set in Hive right before executing the HQL, that would be ideal.
Thanks in advance.
I think you should be able
set mapreduce.output.basename = part-;
This won't work. The only way I have found is with a custom file writer.

pig shell setup: automatically executing pig scripts

Is there a way to automatically run a pig script when invoking pig from command line?
The reason I'm wondering about this is that I have several import and define statements that I use constantly over and over to set everything up. Is it possible to define this collection of statements somewhere so that when I start pig, it will automatically execute those lines? I apologize in advance if this is something trivial that I missed from the documentation.
yes you can certainly do so from version 0.11 onwards.
You need to use .pigbootup file.
Here is a nice blogpost on setting up the pigbootup file
http://hadoopified.wordpress.com/2013/02/06/pig-specify-a-default-script/
If you want to include Pig-Macros from a file you can use the import command
Take a look at http://pig.apache.org/docs/r0.9.1/cont.html#import-macros for reference

How can I resolve MSI paths in VBScript?

I am looking for resolving the paths of files in an MSI with or without installing (whichever is faster) from outside an MSI, using VBScript.
I found a similar query using C# instead and Christpher had provided a solution, below: How can I resolve MSI paths in C#?
I am going through the very same pain now but is there anyway to achieve this using WindowsInstaller object in VBScript, rather than go with endless queries through SQL Tables of MSI back and forth to achieve the same. Though any direction would be welcoming as I have tried tested whatever I can with very limited success.
yes there is a solution without installing the msi and using vbscript.
there is a very good example in the Windows Installer SDK called "WiFilVer.vbs"
using that example i've thrown together an quick example script that does exactly what you need.
set installer = CreateObject("WindowsInstaller.Installer")
const READONLY = 0
set db = installer.OpenDataBase("<FULL PATH TO YOUR MSI>", READONLY)
set session = installer.OpenPackage(db, READONLY)
session.DoAction("CostInitialize")
session.DoAction("CostFinalize")
set view = db.OpenView("SELECT File, Directory_, FileName, Component_, Component FROM File,Component WHERE Component=Component_ ORDER BY Directory_")
view.Execute
set record = view.Fetch
do until record is nothing
file = record.StringData(1)
directoryName = record.StringData(2)
fileName = record.StringData(3)
if instr(fileName, "|") then fileName = split(fileName, "|")(1)
wsh.echo(session.TargetPath(directoryName) & fileName)
set record = view.Fetch
loop
just add the path to your MSI file.
tell me if you need a more detailed answer. i will have some more time to answer this in detail this evening.
EDIT the promised background (and why i need to call ConstFinalize)
naveen actually MSDN was the only resource that can give an definitive answer on this, but you need to know where and how to look since windows installer ist IMHO a pretty complex topic.
I really recommend a mix of the msdn installer function reference, the database reference, and the examples from the windows installer SDK (sorry couldn't find a download link, i think its somewhere hidden in the like 3GB windows SDK)
first you need general knowledge of MSIs:
an MSI is actually a relational database.
Everything is stored in tables that relate to each other.
(actually not everything, but i will try to keep it simple ;))
This database is interpreted by the Windows Installer,
this creates a 'Session'
also some parts are dynamically resolved, depending on the system you install the msi on,
like 'special' folders similar to environment variables.
E.g. msi has a "ProgramFilesFolder", where windows generally has %ProgramFiles%.
All dynamic stuff only exists in the Installer session, not the database itself.
In your case there are 3 tables you need to look at, take care of the relations and resolve them.
the 'File' table contains all Files, the 'Component' table tells you which file goes into which directory and the 'Directory' table contains all information about the filesystem structure.
Using a SQL Query i could link the Component and File table to find the directory name (or primary key in database jargon).
But the directory table has relations in itself, its structured like a tree.
take a look at this example directory table (taken from the instEd MSI)
The columns are Directory, Directory_Parent and DefaultDir
InstEdAllUseAppDat InstEdAppData InstEd
INSTALLDIR InstEdPF InstEd
CUBDIR INSTALLDIR hkyb3vcm|Validation
InstEdAppData CommonAppDataFolder instedit.com
CommonAppDataFolder TARGETDIR .
TARGETDIR SourceDir
InstEdPF ProgramFilesFolder instedit.com
ProgramFilesFolder TARGETDIR .
ProgramMenuFolder TARGETDIR .
SendToFolder TARGETDIR .
WindowsFolder_x86_VC.1DEE2A86_2F57_3629_8107_A71DBB4DBED2 TARGETDIR Win
SystemFolder_x86_VC.1DEE2A86_2F57_3629_8107_A71DBB4DBED2 WindowsFolder_x86_VC.1DEE2A86_2F57_3629_8107_A71DBB4DBED2 System
The directory_parent links it to a directory. the DefaultDir contains the actual name.
You could now resolve the tree by yourself and replace all special folders(which in a vbscript would be very tedious)...
...or let the windows installer handle that (just like when installing a msi).
now i have to introduce a new thing: Actions (and Sequences):
when running (installing, removing, repairing) an msi a defined list of actions is performed.
some actions just collect information, some change the actual database.
there are list of actions (called sequences) for various things a msi can do,
like one sequence for installing (called InstallExecuteSequence), one for collecting information from the user (the UI of the MSI: InstallUISequence) or one for adminpoint installations(AdminExecuteSequence).
in our case we don't want to run a whole sequence (which could alter the system or just take to long),
luckily the windows installer lets us run single actions without running a whole sequence.
reading the reference of the directory table on MSDN (the remarks section) you can see which action you need:
Directory resolution is performed during the CostFinalize action
so putting all this together the script is easier to read
* open the msi file
* 'parse' it (providing the session)
* query component and file table
* run the CostFinalize action to resolve directory table (without running the whole MSI)
* get the resolved path with the targetPath function
btw i found the targetPath function by browsing the Installer Reference on MSDN
also i just noticed that CostInitialize is not required. its only required if you want to get the sourcePath of a file.
I hope this makes everything clearer, its very hard to explain since it took me like half a year to understand it myself ;)
And regarding PhilmEs answer:
Yes there are more influences to the resolution of the directory table, like custom actions.
keeping that in mind also the administrative installation might result in different directorys (eg. because different sequence might hold different custom actions).
Components have conditions so maybe a file is not installed at all.
Im pretty sure InstEd doesnt take custom actions into account either.
So yes, there is no 100% solution. Maybe a mix of everything is necessary.
The script given by weberik (deriven from MS SDK VB code) is very good, because it makes it easy to analyse the directory table without an own algorithm (which is a mid-size effort to do it in a loop or with a recursion algorithm).
But it gives not a 100% perfect view for all files, see below.
The method of the script is semi-dynamic (can be extended by other actions), but in effect it gives only the static directory structure, similar to a default administrative install or advanced MSI viewers.
Normally this is enough and what we want.
But be aware, that this is not the 100% solution (Knowing before exact the path of each file afterwards). That does mean, this will not give you always the correct paths in some cases:
You use command line parameters which substitute predefined directory table entries.
The MSI uses custom actions which change paths.
Especially it is not guaranteed, that every file is installed. There may be optional conditions and features and this may depend on the install environment.
In effect, a 100% solution is very very hard to achieve, without installing really. It comes near to reprogram nearly the whole windows installer engine. So simplifications are normally sufficient and accepted.
But you can extend the method to cover custom actions, e.g. with adding a line "session.DoAction(..)" for each additional action needed. Or to include command line parameters.
Sometimes it can be tricky. The easier the structure of the MSI is, the more likely it is that you succeed without more efforts.
Alternative to write an own program:
The question is, what you really want to find out, and if it is really necessary to program it:
If you don't want to write an automatical every-day MSI analyzer maybe the following is sufficient for you:
First tip: install the MSI with "msiexec /a mysetup.msi TARGETDIR="c:\mytestpath" . (similar restrictions as script above by weberik)
If the MSI has not used custom actions to change pathes including forgetting to add to the admin sequence ("forgetting" should be taken as the normal case for 99% or existing setups :-), you get the filestructure like if you install "really" with some special namings for the Windows predefined folders which you will find out easily.
If the administrative install lacks some folders, it is often a better idea of fixing the custom action (adding to the admin sequence) and using this scenario as your primary test case.
The advantage is, that only you limit the dynamics used by admin install. BTW, you can use the same command line params or path settings custom actions as in real install.
Second tip: Google for the InstEd tool , go to the file or component table and you will see the resulting MSI paths in the same static way as with the mentioned VB-script after calling CostInitialize/CostFinalize. For human view such an editor view maybe better.
For automatic testing and improvements or accuracy, you need an own program of course.
For those of you that mentioned snippet given is a good starting point. :-)
The rest of you should live easier with one of the two given methods without programming.

Resources