Can I generate an artificial rosbag (pointcloud)? - point-clouds

I have ply data as xyzrgb and timestamp, can I generate a rosbag file from it?

Related

How to read two-dimensional complex data with pyspark?

1) I have a data format like this, it is read by np.fromfile(path_file,dtype=np.complex64)
enter image description here
2)Its data format is named with a raw suffix, which I previously wrote via python's write(path,'wb')
I am passing the data into hdfs, how can I read the data through a kind of rdd, I have used binaryfile to read the data, but the data read out is garbled, please ask how I can read out the data for the kind of format in the figure?
1)sc.binaryfile:It reads out a garbled code
2)sc.textfile:Read out as a string
dataset = np.array([[-1+1j,-1+1j,-1+1j],[-2+2j,-2+2j,-2+2j],[-3+3j,-3+3j,-3+3j],[-4+4j,-4+4j,-4+4j]])
Since there is too much content in the dataset, I approximate my dataset with a 4×3 matrix, which is created with np.array

Bash Script to save csv to 2 files

My script uses an API GET request to pull data and dump to csv. Currently my script outputs the csv using todays date. I would also like to have it output a 2nd and identical csv. The 2nd csv needs to always overwrite itself and would always represent the latest data pull.
My script for generating the 1st csv output is
>> /mnt/d/DGD/"Market Place Scrapes"/$FILEDATE.csv
How do I create the second output?

In NiFi, does converting data to an intermediary format (avro) brings any benefits if the output format is not Avro?

I have NiFi flow that reads text data from CSV files, does some processing and then outputs the data to Kafka in JSON.
The processing step is mostly there to do data deduplication.
Here is a schema of the current flow:
Read CSV files ==> Processing (deduplication) ==> Convert to JSON ==> Push to Kafka
I'm thinking about converting the data to Avro after it's been read and then do the processing.
The proposed flow would be:
Read CSV files ==> Convert to Avro ==> Processing (deduplication) ==> Convert to JSON ==> Push to Kafka
Could this bring any benefits or is it just a useless intermediate step?
There would no reason to convert to Avro. You can configure PublishKafkaRecord with a CSV reader and a JSON writer to go right from CSV to JSON.

Can I pull data directly from hive table to H2O?

We have our data stored in hive text files and parquet files is there anyway to load directly from these into H2O or do we have to go through an intermediate step like csv or pandas dataframe?
yes, you can find all the information you need here
H2O currently supports the following file types:
CSV (delimited) files (including GZipped CSV)
ORC
SVMLight
ARFF
XLS
XLSX
Avro version 1.8.0 (without multifile parsing or column type modification)
Parquet
Notes:
ORC is available only if H2O is running as a Hadoop job.
Users can also import Hive files that are saved in ORC format.
When doing a parallel data import into a cluster:
If the data is an unzipped csv file, H2O can do offset reads, so each node in your cluster can be directly reading its part of the csv file in parallel.
If the data is zipped, H2O will have to read the whole file and unzip it before doing the parallel read.
So, if you have very large data files reading from HDFS, it is best to use unzipped csv. But if the data is further away than the LAN, then it is best to use zipped csv.

Analyzing huge amount of JSON files on S3

I have huge amount of json files, >100TB size in total, each json file is 10GB bzipped, and each line contain a json object, and they are stored on s3
If I want to transform the json into csv (also stored on s3) so I can import them into redshift directly, is writing custom code using hadoop the only choice?
Would it be possible to do adhoc query on the json file without transform the data into other format (since I don't want to convert them into other format first every time I need to do query as the source is growing)
The quickest and easiest way would be to launch an EMR cluster loaded with Hive to do the heavy lifting for this. By using the JsonSerde, you can easily transform the data into csv format. This would only require you to do a insert the data into a CSV formatted table from the JSON formatted table.
A good tutorial for handling the JsonSerde can be found here:
http://aws.amazon.com/articles/2855
Also a good library used for CSV format is:
https://github.com/ogrodnek/csv-serde
The EMR cluster can be short-lived and only necessary for that one job, which can also span across low cost spot instances.
Once you have the CSV format, the Redshift COPY documentation should suffice.
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

Resources