How to COPY data from cassandra table to CSV with where claues? - cassandra-2.0

I need to get selective data copied to CSV file from Cassandra table as the query results is about million records. how to i do it with COPY command
select * from table1 where date='20190825'
cussrently using belo query which is giving all table data
COPY FROM table1 to /tmp/25Aug.csv
i need to get data copied to csv file for only seleted date.

You don't use CQL COPY to extract specific rows to a CSV file.
See COPY syntax https://docs.datastax.com/en/cql-oss/3.3/cql/cql_reference/cqlshCopy.html
Your options are:
DataStax bulk loader = dsbulk. See example at https://www.datastax.com/blog/datastax-bulk-loader-unloading and homepage https://github.com/datastax/dsbulk
Call cqlsh with query statement and redirect output to a file at the shell level
use cql CAPTURE command to send query output to a file
These answers are just a summary from Export cassandra query result to a csv file

Related

No rows selected when trying to load csv file in hdfs to a hive table

I have a csv file called test.csv in hdfs. The file was placed there through filezilla. I am able to view the path as well as the contents of the file when I log in to Edge node through putty using the same account credentials that I used to place the file into hdfs. I then connect to Hive and try to create an external table specifying the location of my csv file in hdfs using the statement below:
CREATE EXTERNAL TABLE(col1 string, col2 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC LOCATION '/file path'
when I execute this command it is creating an external table on hive but the table that is being created is empty with only the columns showing up which i have already mentioned in the create statement. My question is, am I specifying the correct path in the location parameter in the create statement above? I tried using the path which I see on filezilla when I placed my csv file into hdfs which is in the format home/servername/username/directory/subdirectory/file
but this returns an error saying the user whose username is specified in the path above does not have ALL privileges on the file path.
NOTE: I checked the permissions on the file and the directory in which it resides and the user has all permissions(read,write and execute).
I then tried changing the path into the format user/username/directory/subdirectory/file and when I did this I was able to create the external table however the table is empty and does not load all the data in the csv file on which it was created.
I also tried the alternative method of creating an internal table as below and then using the LOAD DATA INPATH command. But this also failed as I am getting an error saying that "there are no files existing at the specified path".
CREATE TABLE foobar(key string, stats map<string, bigint>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
MAP KEYS TERMINATED BY ':' ;
LOAD DATA INPATH '/tmp/foobar.csv' INTO TABLE foobar;
First thing you can't load csv file directly into Hive table which is specified with orc file format while creating. Orc is a compression technique to store data in optimised way. So you can load your data into orc format table by following below steps.
You should create a temp table as text file format.
Load data into it by using the command.
hive> load data in path.....
or else u can use location parameter while creating the table itself.
Now create a hive table as your required file format (RC, ORC, parquet, etc).
-Now load data into it by using following command.
hive> insert overwrite into table foobar as select * from temptbl;
You will get table in orc file format.
In second issue is if you Load data into the table by using LOAD DATA command, the data which is in your file will become empty and new dir will be created in default location (/user/hive/warehouse/) with the table name and data will moved into that file. So check in that location you will see the data.

Result of Hive unbase64() function is correct in the Hive table, but becomes wrong in the output file

There are two questions:
I use unbase64() to process data and the output is completely correct in both Hive and SparkSQL. But in Presto, it shows:
Then I insert the data to both local path and hdfs, and the the data in both output files are wrong:
The code I used to insert data:
insert overwrite directory '/tmp/ssss'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
select * from tmp_ol.aaa;
My question is:
1. Why the processed data can be shown correctly in both hive and SparkSQL but Presto? The Presto on my machine can display this kind of character.
Why the data cannot be shown correctly in the output file? The files is in utf-8 format.
You can try using CAST (AS STRING) over output of unbase64() function.
spark.sql("""Select CAST(unbase64('UsImF1dGhvcml6ZWRSZXNvdXJjZXMiOlt7Im5h') AS STRING) AS values FROM dual""").show(false)```

hive, get the data location using an one-liner

I wonder if there is a way to get the data location from hive using a one-liner. Something like
select d.location from ( describe formatted table_name partition ( .. ) ) as d;
My current solution is to get the full output and then parse it.
Unlike traditional RDBMS, Hive metadata is stored in a separate database. In most cases it is in MySQL or Postgres. The metastore database details can be found in hive-site.conf. If you have access to the metastore database, you can run SELECT on table TBLS to get the details about the tables and COLUMNS_V2 to get the details about columns etc..
If you do not have access to the metastore, the only option is to describe each table to get the details. If you have a lot of databases and tables, you could write a shell script to get the list of tables using "show tables" and loop around the tables.
Two methods if you do not have access to the metadata.
Parse DESCRIBE TABLE in the shell like in this answer: https://stackoverflow.com/a/43804621/2700344
Also Hive has a virtual column INPUT__FILE__NAME.
select INPUT__FILE__NAME from table
will output locations URLs for each file.
You can split URL by '/', get element you need, aggregate, etc

how to preprocess the data and load into hive

I completed my hadoop course now I want to work on Hadoop. I want to know the workflow from data ingestion to visualize the data.
I am aware of how eco system components work and I have built hadoop cluster with 8 datanodes and 1 namenode:
1 namenode --Resourcemanager,Namenode,secondarynamenode,hive
8 datanodes--datanode,Nodemanager
I want to know the following things:
I got data .tar structured files and first 4 lines have got description.how to process this type of data im little bit confused.
1.a Can I directly process the data as these are tar files.if its yes how to remove the data in the first four lines should I need to untar and remove the first 4 lines
1.b and I want to process this data using hive.
Please suggest me how to do that.
Thanks in advance.
Can I directly process the data as these are tar files.
Yes, see the below solution.
if yes, how to remove the data in the first four lines
Starting Hive v0.13.0, There is a table property, tblproperties ("skip.header.line.count"="1") while creating a table to tell Hive the number of rows to ignore. To ignore first four lines - tblproperties ("skip.header.line.count"="4")
CREATE TABLE raw (line STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
CREATE TABLE raw_sequence (line STRING)
STORED AS SEQUENCEFILE
tblproperties("skip.header.line.count"="4");
LOAD DATA LOCAL INPATH '/tmp/test.tar' INTO TABLE raw;
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)
INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw;
To view the data:
select * from raw_sequence
Reference: Compressed Data Storage
Follow the below steps to achieve your goal:
Copy the data(ie.tar file) to the client system where hadoop is installed.
Untar the file and manually remove the description and save it in local.
Create the metadata(i.e table) in hive based on the description.
Eg: If the description contains emp_id,emp_no,etc.,then create table in hive using this information and also make note of field separator used in the data file and use the corresponding field separator in create table query. Assumed that file contains two columns which is separated by comma then below is the syntax to create the table in hive.
Create table tablename (emp_id int, emp_no int)
Row Format Delimited
Fields Terminated by ','
Since, data is in structured format, you can load the data into hive table using the below command.
LOAD DATA LOCAL INPATH '/LOCALFILEPATH' INTO TABLE TABLENAME.
Now, local data will be moved to hdfs and loaded into hive table.
Finally, you can query the hive table using SELECT * FROM TABLENAME;

What is the best way to produce large results in Hive

I've been trying to run some Hive queries with largish result sets. My normal approach is to submit a job through the WebHCat API, and read the results from the resulting stdout file, or to just run hive at the console and pipe stdout to a file. However, with large results (more than one reducer used), the stdout is blank or truncated.
My current solution is to create a new table from the results CREATE TABLE FROM SELECT which introduces an extra step, and leaves the table to clear up afterwards if I don't want to keep the result set.
Does anyone have a better method for capturing all the results from such a Hive query?
You can write the data directly to a directory on either hdfs or the local file system, then do what you want with the files. For example, to generate CSV files:
INSERT OVERWRITE DIRECTORY '/hive/output/folder'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
SELECT ... FROM ...;
This is essentially the same as CREATE TABLE FROM SELECT but you don't have to clean up the table. Here's the full documentation:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries

Resources