How do I get schema / column names from parquet file? - hadoop

I have a file stored in HDFS as part-m-00000.gz.parquet
I've tried to run hdfs dfs -text dir/part-m-00000.gz.parquet but it's compressed, so I ran gunzip part-m-00000.gz.parquet but it doesn't uncompress the file since it doesn't recognise the .parquet extension.
How do I get the schema / column names for this file?

You won't be able "open" the file using a hdfs dfs -text because its not a text file. Parquet files are written to disk very differently compared to text files.
And for the same matter, the Parquet project provides parquet-tools to do tasks like which you are trying to do. Open and see the schema, data, metadata etc.
Check out the parquet-tool project
parquet-tools
Also Cloudera which support and contributes heavily to Parquet, also has a nice page with examples on usage of parquet-tools. A example from that page for your use case is
parquet-tools schema part-m-00000.parquet
Checkout the Cloudera page. Using the Parquet File Format with Impala, Hive, Pig, HBase, and MapReduce

Parquet CLI:
parquet-cli is a light weight alternative to parquet-tools.
pip install parquet-cli //installs via pip
parq filename.parquet //view meta data
parq filename.parquet --schema //view the schema
parq filename.parquet --head 10 //view top n rows
This tool will provide basic info about the parquet file.
UPDATE (Alternatives):
If you wish to do this using a GUI tool then checkout this answer - View Parquet data and metadata using DBeaver
DuckDB CLI
DuckDB has CLI tool (prebuilt binaries for linux, windows, macOS) that can be used to query parquet data from command line.
PS C:\Users\nsuser\dev\standalone_executable_binaries> ./duckdb
Connected to a transient in-memory database.
Read Parquet Schema.
D DESCRIBE SELECT * FROM READ_PARQUET('C:\Users\nsuser\dev\sample_files\userdata1.parquet');
OR
D SELECT * FROM PARQUET_SCHEMA('C:\Users\nsuser\dev\sample_files\userdata1.parquet');
┌───────────────────┬─────────────┬──────┬─────┬─────────┬───────┐
│ column_name │ column_type │ null │ key │ default │ extra │
├───────────────────┼─────────────┼──────┼─────┼─────────┼───────┤
│ registration_dttm │ TIMESTAMP │ YES │ │ │ │
│ id │ INTEGER │ YES │ │ │ │
│ first_name │ VARCHAR │ YES │ │ │ │
│ salary │ DOUBLE │ YES │ │ │ │
└───────────────────┴─────────────┴──────┴─────┴─────────┴───────┘
more on DuckDB described here.

If your Parquet files are located in HDFS or S3 like me, you can try something like the following:
HDFS
parquet-tools schema hdfs://<YOUR_NAME_NODE_IP>:8020/<YOUR_FILE_PATH>/<YOUR_FILE>.parquet
S3
parquet-tools schema s3://<YOUR_BUCKET_PATH>/<YOUR_FILE>.parquet
Hope it helps.

If you use Docker you can also run parquet-tools in a container:
docker run -ti -v C:\file.parquet:/tmp/file.parquet nathanhowell/parquet-tools schema /tmp/file.parquet

Apache Arrow makes it easy to get the Parquet metadata with a lot of different languages including C, C++, Rust, Go, Java, JavaScript, etc.
Here's how to get the schema with PyArrow (the Python Apache Arrow API):
import pyarrow.parquet as pq
table = pq.read_table(path)
table.schema # pa.schema([pa.field("movie", "string", False), pa.field("release_year", "int64", True)])
See here for more details about how to read metadata information from Parquet files with PyArrow.
You can also grab the schema of a Parquet file with Spark.
val df = spark.read.parquet('some_dir/')
df.schema // returns a StructType
StructType objects look like this:
StructType(
StructField(number,IntegerType,true),
StructField(word,StringType,true)
)
From the StructType object, you can infer the column name, data type, and nullable property that's in the Parquet metadata. The Spark approach isn't as clean as the Arrow approach.

Maybe it's capable to use a desktop application to view Parquet and also other binary format data like ORC and AVRO. It's pure Java application so that can be run at Linux, Mac and also Windows. Please check Bigdata File Viewer for details.
It supports complex data type like array, map, etc.

If you are using R, the following wrapper function on functions existed in arrow library will work for you:
read_parquet_schema <- function (file, col_select = NULL, as_data_frame = TRUE, props = ParquetArrowReaderProperties$create(),
...)
{
require(arrow)
reader <- ParquetFileReader$create(file, props = props, ...)
schema <- reader$GetSchema()
names <- names(schema)
return(names)
}
Example:
arrow::write_parquet(iris,"iris.parquet")
read_parquet_schema("iris.parquet")
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"

Since it is not a text file, you cannot do a "-text" on it.
You can read it easily through Hive even if you do not have the parquet-tools installed, if you can load that file to a Hive table.

Related

Ansible playbook which uses a role defined in a collection

This is an example of an Ansible playbook I am currently playing around with:
---
- hosts: all
collections:
- mynamespace.my_collection
roles:
- mynamespace.my_role1
- mynamespace.my_role2
- geerlingguy.repo-remi
The mynamespace.my_collection collection is a custom collection that contains a couple of roles, namely mynamespace.my_role1 and mynamespace.my_role2.
I have a requirements.yml file as follows:
---
collections:
- name: git#github.com:mynamespace/my_collection.git
roles:
- name: geerlingguy.repo-remi
version: "2.0.1"
And I install the collection and roles as follows:
ansible-galaxy collection install -r /home/ansible/requirements.yml --force
ansible-galaxy role install -r /home/ansible/requirements.yml --force
Each time I attempt to run the playbook it fails with the following error:
ERROR! the role 'mynamespace.my_role1' was not found in mynamespace.my_collection:ansible.legacy:/home/ansible/roles:/home/ansible_roles:/home/ansible
The error appears to be in '/home/ansible/play.yml': line 42, column 7, but may
be elsewhere in the file depending on the exact syntax problem.
The offending line appears to be:
roles:
- mynamespace.my_role1
^ here
For the avoidance of doubt, I have tried multiple ways of defining the roles in the playbook including mynamespace.my_collection.my_role1 (the fully qualified name of the role within the collection).
I suspect I've done something wrong or misunderstood how it should work but my understanding is a collection can contain multiple roles and once that collection is installed, I should be able to call upon one or more of the roles within the collection inside my playbook to use it but it doesn't seem to be working for me.
The error seems to suggest it is looking for the role inside the collection but not finding it.
The collection is installed to the path /home/ansible_collections/mynamespace/my_collection and within that directory is roles/my_role1 and roles/my_role2.
Maybe the structure of the roles inside the collection is wrong?
I'm using Ansible 2.10 on CentOS 8.
Thanks for any advice!
EDIT:
I just wanted to expand on something I alluded to earlier. I believe the docs say the fully qualified name should be used to reference the role in the collection within the playbook.
Unfortunately, this errors too:
ERROR! the role 'mynamespace.my_collection.my_role1' was not found in mynamespace.my_collection:ansible.legacy:/home/ansible/roles:/home/ansible_roles:/home/ansible
The error appears to be in '/home/ansible/play.yml': line 42, column 7, but may
be elsewhere in the file depending on the exact syntax problem.
The offending line appears to be:
roles:
- mynamespace.my_collection.my_role1
^ here
I posted this as an issue over at the ansible/ansible repo and we did get to the bottom of this.
One small clue missing is the contents of my /etc/ansible/ansible.cfg file:
COLLECTIONS_PATHS(/etc/ansible/ansible.cfg) = ['/home/ansible_collections']
DEFAULT_ROLES_PATH(/etc/ansible/ansible.cfg) = ['/home/ansible_roles']
To quote contributor Sloane Hertel directly:
There's a bug/discrepancy between how ansible-galaxy and ansible handle the collections path. ansible-galaxy tries to be smart and adds ansible_collections to the path if it's not there already, but ansible always joins ansible_collections to the path - the playbook is trying to find the collection in /home/ansible_collections/ansible_collections/.
The solution, therefore, is to change my COLLECTIONS_PATHS value from /home/ansible_collections to /home.
From then on ansible-playbook will be searching for any roles in collections in the path /home/ansible_collections/mynamespace/roles instead of /home/ansible_collections/ansible_collections/mynamespace/roles.
I changed my directory structure slightly:
home/
├─ ansible_collections/
│ ├─ mynamespace/
│ │ ├─ my_collection/
│ │ │ ├─ roles/
│ │ │ │ ├─ mynamespace
│ │ │ │ │ ├─ my_role1/
│ │ │ │ │ │ ├─ meta/
│ │ │ │ │ │ ├─ tasks/
Which now means my playbook file looks like:
---
- hosts: all
collections:
- mynamespace.my_collection
roles:
- mynamespace.my_role1
- mynamespace.my_role2
- geerlingguy.repo-remi
And the roles are now found correctly.
ansible playbook, was formerly known as converge. Try converting your workspace to yml file format for read only data structure.

How to get the accountId that is required to do a cloud build with nativescript?

Currently the following command:
tns cloud build android --accountId=<cannot find the account id assigned to our email>
returns this line: Invalid accountId index provided
How could you get the correct accountId for your account?
(note that we have already used tns login to login successfully)
'1' works for me.
--accountId=1
An interesting solution from pansila :)
Still, if you want to build with your account ID, you should log in with tns login and then use the following command to get the ID
tns account
It will spawn information similar to the following
┌───┬──────────────────────────────────────┬────────────────┬──────────┐
│ # │ Id │ Account │ Type │
│ 1 │ 000000-11111-2222222-555555 │ myaccount name │ personal │
└───┴──────────────────────────────────────┴────────────────┴──────────┘
So then you could make cloud build with
tns cloud accountId=000000-11111-2222222-555555

Golang cannot write file into directory

Situation:
I'm trying to write a file into a directory, like shown as follows:
func (p *Page) Save() error {
filepath := DerivePath(p.Title)
fmt.Println(filepath)
content, _ := json.MarshalIndent(p, "", " ")
err := ioutil.WriteFile(filepath, content, 0600)
return err
}
Problem:
The following error occurs in line 5:
open data/Testpage.json: The system cannot find the path specified.
I already tried to create the file before writing with os.Create, but it doesn't work either.
Loading from the data directory works just fine. Only writing new files into the directory fails.
Additional information:
My project structure is as follows:
│ .gitignore
│ .project
│
├───bin
│ main.exe
│
├───data
│ Welcome.json
│
├───pkg
│ └───windows_amd64
│ page.a
│
├───src
│ ├───main
│ │ main.go
│ │
│ └───page
│ page.go
│ page_test.go
│
└───templates
view.html
As mentioned above, reading data/Welcome.json works just fine (by using io/ioutils.ReadFile).
The source is available on https://gitlab.com/thyaris/Wiki.
Executing D:\GitWorkspaces\Wiki\wiki>go test -v page writes the following output:
=== RUN TestSave
data/Testpage.json
--- FAIL: TestSave (0.00s)
page_test.go:15: open data/Testpage.json: The system cannot find the path specified.
page_test.go:19: 'Testpage.json' was not created
=== RUN TestLoadPage
--- FAIL: TestLoadPage (0.00s)
page_test.go:26: Error while loading
page_test.go:32: File content did not match
=== RUN TestDelete
--- PASS: TestDelete (0.00s)
FAIL
exit status 1
FAIL page 0.094s
Your problem here is that the test engine is not running your executable with the working directory you expect. Instead of using the working directory defined by your shell or IDE, it is setting it to the source directory of the code being tested. (I had this bite me too once, long ago :) I had almost forgotten about that...)
The simple solution is to change DerivePath so that you can set the prefix externally, then change it to a the path you need at the beginning of your tests. There are other (possibly better?) solutions of course.

Is Log a compressed table engine in Clickhouse

I have a Log table and also a MergeTree table. In the system.columns table, it has a column, data_compressed_bytes, showing bytes compressed for each column for each table. I can see that the MergeTree table showing values under the column but for the Log table, the column shows all zeros.
Log
┌─database─┬─table──┬─name───────────┬─type─────┬─data_compressed_bytes─┬─data_uncompressed_bytes─┬─marks_bytes─┐
│ default │ logs │ log_time │ DateTime │ 0 │ 0 │ 0 │
│ default │ logs │ start_time │ DateTime │ 0 │ 0 │ 0 │
MergeTree
┌─database─┬─table─────┬─name────────┬─type─────┬─data_compressed_bytes─┬─data_uncompressed_bytes─┬─marks_bytes─┐
│ default │ logs_m │ log_date │ Date │ 1221802 │ 20000000 │ 19536 │
│ default │ logs_m │ log_time │ DateTime │ 25181624 │ 40000000 │ 19536 │
So, I am wondering if it means that columns in engine type Log are actually compressed or not.
ClickHouse documentation states that TinyLog is compressed but not sure about Log and I don't see that in the system.columns table.
Log engine compresses column data as well as TinyLog.
Quotes from the doc:
TinyLog The simplest table engine, which stores data on a disk. Each
column is stored in a separate compressed file.
Log differs from TinyLog in that a small file of “marks” resides with
the column files.
The information about compressed and decompressed sizes of a column is not reflected into system.columns table because Log is a quite simple engine (unlike MergeTree) and doesn't store a lot of metainformation about own column files (it only maintains sizes.json file with compressed column sizes).
So, it is possible to set system.columns.data_compressed_bytes for Log's columns, but at the same time system.columns.data_uncompressed_bytes will be zero and it may look questionable.

Setting config when using 'make menuconfig'

I create a new config with my Kconfig, like this:
config VIDEO_MY_DRIVER
bool "my driver"
default y
depends on VIDEO_DEV && VIDEO_V4L2
select V4L2_MEM2MEM_DEV
---help---
This is a my driver
When I run 'make menuconfig' and when I search for 'CONFIG_VIDEO_MY_DRIVER', I See it.
Symbol: VIDEO_MY_DRIVER [=n]
│ Type : boolean
│ Prompt: my driver │
│ Location:
│ -> Device Drivers
│ (1) -> Multimedia support (MEDIA_SUPPORT [=y])
│ Defined at drivers/media/platform/mydriver/Kconfig:5
│ Depends on: MEDIA_SUPPORT [=y] && VIDEO_DEV [=n] && VIDEO_V4L2 [=n]
│ Selects: V4L2_MEM2MEM_DEV [=n]
│
But when I want to set it, I go to 'Device Drivers'-> 'Multimedia Support', I don't find the option to set it.
Can you please tell me if I make a mistake in my 'Kconfig' or where should I look for when I try to set it under 'Device Drivers'?
This link may help you get some info
It seems to me that for that option to appear, first check whether the dependencies of your module are enabled like here in your case is VIDEO_DEV and VIDEO_V4L2 . In your scenario it is still (=n) not included as part of your kernel source.

Resources