Clickhouse - Multi Storage Transformation - clickhouse

I am using One node clickhouse with one disk 14TB. Disk path is specified under config.xml
So I want to add new disk and distribute data between disks.
For example
existing disk /DATA capacity 14TB and 12TB data inside it. I want to add new disk /DATA2 and transfer 6TB data from /DATA to /DATA2. At the end of the day size should be /DATA 6TB and /DATA2 6TB and if write request come to clichouse, clickhouse must write data to disk with round robin manner.

Related

Why cant the metadata be stored in HDFS

Why cant the metadata be stored in HDFS with 3 replication. Why does it store in the local disk?
Because it will take more time to name node in resource allocation due to several I/o operations. So it's better to store metadata in memory of name node.
There are multiple reason
If it stored on HDFS, there will be network I/O. which will be
slower.
Name-node will have dependency on data node for metadata.
Again Metadata will be require for metadata to Name-node, So that it can identify where the metadata is on hdfs.
METADATA is the data about the data such as where the block is stored in rack, so that it can be located and if metadata is stored in hdfs and if those datanodes fail's you will lose all your data because now you don't know how to access those blocks where your data was stored.
Even though if you keep replication factor more, for each changes in datanodes, the changes are made in replicas of data nodes as well as in namenode's edit log.
Now since we have 3 replicas of namenodes for every change in datanode it first have to change in
1.Its own replica blocks
In namenode and replicas of namenode.(edit_log is edited 3times )
This would cause to write more data than first.But data storage is not the only and major problem,the main problem is the time that is required to do all these operations.
Therefore namenodes are backup on remote disk,so that even though your whole clusters get fails(possibilities are less) you can always backup your data.
To save from namenode failure Hadoop comes with
Primary Namenode ->consisits of namespace image and edit logs.
Secondary Namenode -> merging namespace and editlogs so that edit logs dont become too large.

When to move data to HDFS/Hive?

So I'm developing an application that is expected to deal with large amounts of data, and as such I've decided to use Hadoop to process it.
My services node and datanodes are separated from the webapp, so I'm using HttpFS to communicate the app with Hadoop.
So, whenever a new row of data is generated in my application, should I already call the corresponding HttpFS URL to append the data to an HDFS file? Should I write this data in a file in the webserver and using a cronjob upload it to HDFS for example every hour?
Should I have the Hive table updated or should I just load the data in there whenever I need to query it?
I'm pretty new to Hadoop so any link that could help will also be useful.
I prefer below approach.
Do not call HtpFS URL to append data to HDSF file for every row update. HDFS is efficient when data file size is more than 128 MB (in Hadoop 2.x) or 64 MB (in Hadoop 1.x)
Write the data in web server. Have a rolling appender when file size reaches certain limit - in multiples of 128 MB e.g 1 GB file.
You can have hourly based cron jobs but make sure that you are sending big data file (e.g 1 GB or multiple of 128 MB) instead of just sending the log file, which is accumulated in 1 hour.
Regarding loading of data, you can use internal or external HIVE tables. Have a look at this article

Storing a file on Hadoop when not all of its replicas can be stored on the cluster

Can somebody let me know what will happen if my Hadoop cluster (replication factor = 3) is only left with 15GB of space and I try to save a file which is 6GB in size?
hdfs dfs -put 6gbfile.txt /some/path/on/hadoop
Will the put operation fail giving error(probably cluster full) or will it save two replicas of the 6GB file and mark the blocks which it cannot save on the cluster as under-replicated and thereby occupying the whole of 15GB of leftover?
You should be able to store the file.
It will try and accommodate as many replicas as possible. When it fails to store all the replicas, it will throw a warning but not fail. As a result, you will land up with under-replicated blocks.
The warning that you would see is
WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not able to place enough replicas
When ever you fire the put command :
dfs utility is behaving like a client here .
client will contact namenode first , then namenode will guide client, where to write the blocks and will keep the maintain metadata for that file , then its client responsibility to break data in block as per configuration specified.
Then client will then make a direct connection with different data nodes , where it has to write different blocks as per namenode reply.
First copy of data would be written by client only on data nodes ,subsequent copies data node will create on each other with the guidance from namenode .
So you should be able to put the file of 6 gb if 15 gb space is there ,because initially the original copies gets created on hadoop , later on once the replication process will start then problem would get arise.

Add a entire directory to hadoop file system (hdfs)

I have data that is stored within the sub directories and would like to put the parent directory in the HDFS. The data is always present at the last directory and the directory structure extends upto 2 levels.
So the structure is [parent_dir]->[sub_directories]->[sub_directories]->data
I tried to add the entire directory by doing
hadoop fs -put parent_dir input
This takes a long long time ! The sub directories are possibly 258X258. And this eventually fails with
ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(X.X.X.245:50010, storageID=DS-262356658-X.X.X.245-50010-1394905028736, infoPort=50075, ipcPort=50020):DataXceiver
org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: No space left on devic
I can see the required space on the nodes. What am I doing wrong here ?
Also I the way I was planning to access my files was
hadoop jar Computation.jar input/*/* output
This worked well for small data set.
That error message is usually fundamentally correct. You may not be taking into account the replication factor for the HDFS filesystem. If your replication factor is set to 3, which is the default, then you need 300GB of storage available to store a 100GB dataset.
There are a couple of things you can do to help get around the issue:
1) Decrease your replication factor (dfs.replication), and your maximum blocks (dfs.replication.max) to 2 in your hdfs-site.xml
2) Compress your datasets. Hadoop can operate on bzip and gzip compressed files (though you need to be careful of splitting)

Hadoop. About file creation in HDFS

I read that whenever the client needs to create a file in HDFS (The Hadoop Distributed File System), client's file must be of 64mb. Is that true? How can we load a file in HDFS which is less than 64 MB? Can we load a file which will be just for reference for processing other file and it has to be available to all datanodes?
I read that whenever the client needs to create a file in HDFS (The Hadoop Distributed File System), client's file must be of 64mb.
Could you provide the reference for the same? File of any size can be put into HDFS. The file is split into 64 MB (default) blocks and saved on different data nodes in the cluster.
Can we load a file which will be just for reference for processing other file and it has to be available to all datanodes?
It doesn't matter if a block or file is on a particular data node or on all the data nodes. Data nodes can fetch data from each other as long as they are part of a cluster.
Think of HDFS as a very big hard drive and write the code for reading/writing data from HDFS. Hadoop will take care of the internals like 'reading from' or 'writing to' multiple data nodes if required.
Would suggest to read the following 1 2 3 on HDFS, especially the 2nd one which is a comic on HDFS.

Resources