Under the Simple Coherency Model section of the HDFS Archiectiure guide, it states (emphasis mine):
HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A Map/Reduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future.
I am confused by the use of "need not" here. Do they really mean "must not" or "should not"? If so, how can programs like HBase provide update support? If they really do mean "need not" (i.e. "doesn't have to"), what is trying to be conveyed? What file systems requires you to change a file once written?
Up to what I know, the need not is part of the assumption that "simplifies data coherency issues that enables high...". Actually means can't. But you can delete and create again the hole file.
After hadoop 0.20.2-append (like shown here) you can append data.
For all I read, I understand that HBase uses mainly memory (WAL? section 11.8.3) and modifications gets appended as marks. For example, to delete a column it makes a tombstone (see section 5.8.1.5) just marking the delete, and periodical compactation.
Maybe I am wrong. Good moment for me to learn the exact explanation :)
Related
Let's say I am sending log entries to ElasticSearch. We are considering adding the calling method, calling class, and line of code to our log entries. Being that these fields will contain similar values, would ElasticSearch attempt to preserve disk space by not copying this data for every occasion of the same value?
EDIT - Additional clarification: I did not read anywhere that Elastic does this. I know that some data storage systems, like columnar databases, write their data to disk so as to preserve disk storage by not writing duplicated data over and over again. So I am wondering if ElasticSearch implements similar techniques..
As far as I know: no, it doesn't. It would make several key features quite hard I believe, and I have not seen any reference to this practice.
It's tricky to 'proof' the non-existence of some method unless you look at all source-code, but I would expect this page about disk usage tuning to containt references to this practice.
Did you read anywhere about this, or does it just seem practical to you?
According to the documentation, short circuit reads are faster as they doesn't go through the data node. If this is the case then
Why isn't this enabled by default?
In which scenarios do we need short circuit reading?
Take a look at this article: http://blog.cloudera.com/blog/2013/08/how-improved-short-circuit-local-reads-bring-better-performance-and-security-to-hadoop/
Summary of article:
One major downside to the original implementation is that it screwed with security implications. It had to give clients direct read access to the data files. I guess this was bad for kerberos enabled hdfs.
The new implementation passes a file descriptor instead, which supposedly is more secure and faster.
I guess there were some downsides to the old method. I don't see what the downsides to the new method are. I couldn't find a definitive answer in which version of Hadoop the new method appeared.
I see some variables named 'dirty' in some source code at work and some other code. What does it mean? What is a dirty flag?
Generally, dirty flags are used to indicate that some data has changed and needs to eventually be written to some external destination. It isn't written immediate because adjacent data may also get changed and writing bulk of data is generally more efficient than writing individual values.
There's a deeper issue here - rather than "What does 'dirty mean?" in the context of code, I think we should really be asking - is 'dirty' an appropriate term for what is generally intended.
'Dirty' is potentially confusing and misleading. It will suggest to many new programmers corrupt or erroneous form data. The work 'dirty' implies that something is wrong and that the data needs to be purged or removed. Something dirty is, after all undesirable, unclean and unpleasant.
If we mean 'the form has been touched' or 'the form has been amended but the changes haven't yet been written to the server', then why not 'touched' or 'writePending' rather than 'dirty'?
That I think, is a question the programming community needs to address.
Dirty could mean a number of things, you need to provide more context. But in a very general sense a "dirty flag" is used to indicate whether something has been touched / modified.
For instance, see usage of "dirty bit" in the context of memory management in the wiki for Page Table
"Dirty" is often used in the context of caching, from application-level caching to architectural caching.
In general, there're two kinds of caching mechanisms: (1) write through; and (2) write back. We use WT and WB for short.
WT means that the write is done synchronously both to the cache and to the backing store. (By saying the cache and the backing store, for example, they can stand for the main memory and the disk, respectively, in the context of databases).
In contrast, for WB, initially, writing is done only to the cache. The write to the backing store is postponed until the cache blocks containing the data are about to be modified/replaced by new content.
The data is the dirty values. When implementing a WB cache, you can set dirty bits to indicate whether a cache block contains dirty value or not.
Systems have to sometimes accommodate the possibility of real world bad data. Consider that some data originates with paper forms. And forms inherently have a limited means of validating data.
Example 1: On one form users are expected to enter an integer distance (in miles) into a blank. We capture the information as written as a string since we don't always end up getting integer values.
Example 2: On another form we capture a code. That code should map to one of the codes in our system. However, sometimes the code written on the form is incorrect. We capture the code and allow it to exist with an invalid value until some future time of resolution. That is, we temporarily allow bad data since it's important to record the record even if some of it is invalid.
I'm interested in learning more about how systems accommodate bad data, that is, human error. Databases are supposed to be bastions of data integrity, but the real world is messy and people make mistakes. Systems must allow us to reflect those mistakes.
What are some ways systems you've developed accommodate human error? What practices have you used? What lessons have you learned?
Any further reading on the topic? (I had trouble Googling it.)
I agree with you, whatever we do there's no guarantee that we can get rid of bad or incorrect data. Especially, but not only, if it comes to user input. In my experience the same problems exist in complex integration projects, in which you have to integrate and merge (often inconsistent) data retrieved from different systems.
A good strategy is to decouple the input from the operational system itself. First, place user (or external system) provided data in a separate datastore (e.g. different schema). In a second step load this data into your operational datastore, but only if it confirms to strict rules (e.g. use address verification software to verify a given address). This Extract, Transform, Load (ETL) approach is fairly common in Data Warehousing (DWH) solutions, but can be applied programmatically in transactional systems as well (in my experience).
The above approach often leads to asynchronous processes in which the input is subitted first and (maybe) at a later time the external entity (user or system) retrives feedback whether its data was correct or not.
EDIT: For further readings I recommend to have a look at DWH concepts. Alhtough, you may not want to build such a thing, you could partially apply those concepts:
http://en.wikipedia.org/wiki/Extract,_transform,_load
http://en.wikipedia.org/wiki/Data_warehouse
http://en.wikipedia.org/wiki/Data_cleansing
A government department I worked in does a lot of surveys, most of which are (were) still paper based.
All the results were OCR'd into the system.
As part of the OCR process a digital scan of the forms is kept.
Data is then validated, data that is undecipherable or which fails validation is flagged.
When a human operator reviews the digital data they can modify the data if they are confident that they can correctly interpret what the code could not; they (here's the cool bit) can also bring up the scan of the paper based original, and use that to determine what the user was trying to say.
On a different thread; at some point you want to validate the data coming in against any expected data ranges that you want it to conform to; buy rejecting it at the point of entry you give the user a chance to correct it - the trade off is that every time you reject it you increase the chance of them abandoning the whole process.
At some point in your system you need to specify the rules which will be used for validation. At the end of the day a system is only going to be as smart as those rules. You can develop these yourself into the code (probably the business logic) or you might use a 3rd party component.
having flexible control over the validation is pretty important as they are likely to change overtime.
To be honest with you, one point of migrating from paper-based systems to IT is to remove these errors and make sure all data is always correct. I doubt any correctly planned and developed IT system (especially business financial systems) would allow such errors. Not in the company I am working for anyway...
There are lots of software tools that address the kinds of problems you mention. There are platforms and tools that let you define rules for scrubbing and transforming data and handling validation errors. Those techniques are widely used for Data Integration and Business Intelligence applications. Google for "Data Quality" or "Data Integration".
The easiest thing to do is to (this is not always possible) design the interface where users enter the data to limit as much as possible the amount of text that they need to enter. In my experience this seems to be where a lot of problems come from. One simple example of this is to provide a select, or auto-complete select field
One thing that you could do is do everything possible to determine if the data is correct before going into the db. I try to give the user entering the data as much feedback as possible so they can (ideally) fix some of the issues before the data gets persisted. For example, it is a very quick check to determine if the data being entered is of the correct type.
I got started in legal systems before the PC era. Litigation support databases routinely have to accommodate factually incorrect, incomplete, and contradictory information. It takes a different way of thinking.
The short version . . .
Instead of recording a single fact, you record multiple assertions about a fact. It boils down to designing a database to store data from assertions like these.
In an interview at 2011-01-03 08:13, Neil Rimes told Officer Cane
that he was at home from 2011-01-02 20:00 until 2011-01-03 08:13.
In an interview at 2011-01-03 08:25, Liza Nevers told Officer Cane
that Neil Rimes came home at 2011-01-02 23:45.
In a deposition at 2011-05-13 10:22, Cody Maxon told attorney Kurt
Schlagel that he saw Neil Rimes at Kroger at 2011-01-03 03:00
I'm building a CMS-type webapp that allows users to enter arbitrary-sized blocks of HTML. These blocks are entered by the user in their admin area and inserted into their template of choice when a page is delivered.
I'm guessing a user is not going to add more than 50-100 blocks and I'm not going to be getting more than 1000 users any time soon.
I was planning on using mySQL's LONGTEXT type to store these but I'm wondering if storing files in a directory will be more performant as the Linux OS will cache them? Given that I'm building for at most (1000 * 100) text blocks is there any reasonable performance worry with using mySQL?
Obviously I will be caching the HTML before delivery so I won't be reading these blocks on every delivery - reads will only occur when someone updates/creates new content.
I could use memcached/other cache/noSQL implementation or some other storage mechanism but I'm focusing on keeping it simple and delivering ASAP so don't want to introduce other stuff that I don't have experience with unless there's a significant performance worry.
Are the blocks of HTML content the only thing you are saving? If so, a file may be easiest.
However, it seems likely that you may want to save other bits of information along with the HTML and be able to query based on those bits of data. For example: date created, date last modified, name of the block, the user(s) who have edited the block.
If this is the case, then a database may be the best way to go. Since you said you do not expect to have many users (at least not a first) I would concentrate on finding the solution that is the fastest / most flexible to program and focus on performance and caching after your website begins to grow in size.
I advise you to use a flat file rather than Mysql to store this kind of data.
Html is more a "file" than a "value information" so it hasn't to be in a DB.
Moreover, you will certainly have better performances.
You can also read this post.