Subscribe by Email

Your email:

Current Articles | RSS Feed RSS Feed

Understanding Modern, Journalled Filesystems

Posted by Grant Mongardi on Fri, Nov 06, 2009 @ 10:54 AM
Share on Facebook Facebook | Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon |  Share on LinkedIn LinkedIn |  Share On Technorati Technorati | Submit to Reddit reddit 
    By understanding how filesystems work, you can hopefully gain a better understanding of how you are and are not at risk. Most modern filesystems used for enterprise storage are what is called "journalled" (or also "transactional"). In this article, I'll try to explain the behaviors of these filesystems in order that you might better be able to assess your risk, and therefore be better able to recover in the event that the unthinkable happens - your data goes away. Please note, that I in no way am implying that the behavior described in this article in any way is universal to all filesystems, only that it is my understanding about most modern-day filesystem that I work with.
    No filesystem is risk-free. There will always be the potential for dataloss, however most enterprise-level filesystems do a good job of preserving the integrity of your files. The various pieces all work together to ensure that your data is there when you need it, regardless of the circumstances. However, the geniuses (and I mean that) that design these systems can't make your files immune to the possibility that we should all know could happen. The idea is to understand that risk, and make the best choices we can to protect ourselves in the event that it does. You may go your entire career without ever encountering this, but better prepared then wanting.
    All hierarchical (folder tree) filesystems contain these two components:
        Metadata
        Data Store
This includes non-journalled filesystems as well as journalled filesystems.
   The Metadata generally includes the superblock and the inode, or file table, which is just an index of the file names, where each file starts being stored on the disk, and where in the hierarchy it resides. It may also contain other infomation such as size, integrity references (checksums), or perhaps storage type (binary or text) as well. This information is used for storage and retrieval of these files, as well as information describing particulars about how this particular filesystem is implemented.
   The DataStore is simply the empty place where the actual file data is stored on disk. Most filesystems break these into blocks of storage of a pre-determined, fixed size. Fixed-size blocks tend to make retreival and writing much faster, as there's much less math for the filesystem drivers to perform in order to find the place where the file is stored and where it ends. There are some exceptions, and these exceptions are typically very optimized for the way they do variable-sized blocks, so any performance hit you take for "doing the math", you more than make up for in having a filesystem optimized for the type of files you're storing on it.
   To retrieve a file, the filesystem looks at the metadata table for the file it's trying to retreive, and then determines where on the filesystem the file starts, and goes and reads that piece. In the process of reading a single file, the filesystem drivers may find that the file is not stored in one contiguous (one after the other) group of blocks. When this happens, it's called 'fragmentation'. No performance filesystem that I know of can totally avoid having some level of fragmentation. As filesystem begin to fill up, with regular file creations and deletions, the filesystem needs to become creative with how it stores these file. Although the ideal situation is to store each file in a contiguous way, it's not always possible without having to rearrange a lot of other allocated blocks in order to get a group of blocks large enough. Trying to do this would make the filesystem incredibly slow whenever you tried to write a file to it, as it would need to perform all sorts of reorginizations any time it wrote a file to big for the available contiguous blocks. Modern disk has improved dramatically in performing these sorts of non-contiguous block reads/writes (known as 'random seeks'), so when a filesystem has a reasonable amount of free space (14% or more), the performance should remain acceptable.
   Now to the hard part - writing files. In order to write a file, the filesystem first determines the size of the file being written, and then it tries to find a group of blocks right next to each other to store the entire thing. Storing files in groups of blocks all next to each other (contiguous), means that the mechanical heads reading the physical underlying disk don't have to move much to get to the next block. For obvious reasons, this is much faster and more efficient. If it does find one, it writes the Metadata, telling the Metadata where it's storing the file, then it writes the actual file data to the Data Store. In the event it cannot find a single group of blocks to store the whole file, it will find the optimum blocks in which to distribute the file across. Different filesystems use different means to describe 'optimum', so I won't get into that here. Suffice it to say, some filesystems are better at it than others. As a filesystem begins to get filled up, the drivers have a much more difficult time in storing files, having to break the file up more, and as such having to perform more calculations to get it all right.

What happens when Murphy's Law strikes?
   Filesystems breaking, or filesystem corruption as it's known, is most often caused by underlying hardware issues such as the physical disk dying or 'going away', or perhaps a power outage (some filesystems are more fragile than others, and can break through regular use, but that's not typical of true enterprise filesystems). If the disk isn't mounted, or there are no writes happening at the time of the failure, the filesystem is very likely to be unharmed, and a simple repair should get it all back to normal. However most enterprise filesystems are in a constant state of flux, with files getting written, deleted, moved or modified nearly all of the time, so having such a failure is often a reason for concern.
   When this failure occurs, any operation that was happening at the time is truncated to the point of the failure. If it was in the middle of writing the file, it means that the entire file not only didn't make it to disk, but parts of the file and file record are likely incomplete, and there's no warning of that other than the broken parts. In a non-journalled filesystem, this means that the damage may not be noticed until someone finds that file that is partially written. When they try to open it, there's any number of possible scenarios. The worst possible scenario is that the storage chain is broken in a really bad way. For instance, the file may start on block number 9434, and then jump from there to block 10471, and then to block 33455, and then back to block 8167. If it died before it had a chance to tell the filesystem all of that, then the filesystem might think it goes from 10471 to 2211, simply because that's the value that was stored for that block prior (or perhaps it's just random data that was there to start with). If 2211 is somewhere within the superblock for your filesystem, and someone tries to resave that file, you've just completely broken your entire filesystem. Oooh, that hurts.
   This is where the jounalling comes in. With a jounalled filesystem, every operation is written down ahead of time in the journal, then the operation is performed, then it's deleted from the journal. In the event of a failure such as described in the last paragraph, when the filesystem comes back up, it will see that there are operations in the journal, and can "roll-back" those operations as if they never happened, hence the filesystem returns to it's last known good state. This ensure that each modification to the filesystem is 'atomic', or only happens when everything is absolutely complete. Even though some of those operations have been completed, the journal is housing all of the not yet complete operations currently being carried out, so when it returns it can just see what was being done and return to what it was before the partially complete operation. Although this isn't totally fool-proof, it certainly is miles ahead of the alternative. This reduces the possibility of a complete data loss failure only to events in which the journal is being written and becomes corrupted, and when the filesystem is also being written and becomes corrupted, and when the Metadata is also being written and becomes corrupted. The likelihood of all three of these things happening is reasonably unlikely where all three things get broken.

I live in a "reasonably unlikely" world. What do I do?
   Regardless of whether you're a lucky person or not, you should always have a good disaster recovery plan. Not only should you have one, you should schedule regular tests of that plan, or at the very least audit it. NAPC has years and years of experience in this area, and can discuss with you the risks associated with your particular configuration, as well as all of the possible scenarios for disaster recovery in the event of a catastrophic failure. To setup such a discussion, contact your Sales representative, or just send us an email!

2 Comments Click here to read/write comments

Staying Current on Support

Posted by Robert Sullivan on Tue, Oct 13, 2009 @ 10:28 AM
Share on Facebook Facebook | Submit to Digg digg it |  Add to delicious  delicious |  Submit to StumbleUpon StumbleUpon |  Share on LinkedIn LinkedIn |  Share On Technorati Technorati | Submit to Reddit reddit 

I ride dirt bikes and recently incurred a 700 dollar repair bill on my sons bike.
What? What was the cause of this? It's a two stroke motocross and the engine had
seized up, which is not all that unusual, but we had done a new top-end not all that
long ago. So why had it blown again so early?

The 'real mechanic' I brought it to explained that pump gas has changed drastically
from what the engine was designed to run on. But the bike is only four years old..!
Pump gas these days is going Greener with at lease 10% ethanol or more, and is so
oxygenated that it burns hotter, expanding the rings tighter around the piston...
you get the idea. Ring-ding-a-ding, goes to bwop-bwooooop... silence.

As back yard mechanics working on dirt bike engines we hadn't done anything wrong with
our top-end. The gas we run had changed and we didn't ever realize the consequences.

I had an issue last week that drove home the importance of staying current
with system wide support. My customer was having difficulty with XMP metadata
that was not showing up in Venture. This was for XMP fields that were working
correctly not that long ago. Venture syncs were also not showing the field values
either. The first instinctive question is "what changed?" And the answer is
"we're not doing anything differently." Like me with my dirt bike.

Xinet engineering had me activate fpod vlog per a specific tech note which will
create a special output file. Copy on a questionable file again and watch for it to
show up in the log. Verify that the XMP data is not showing in the browser and then
run the syncxmp command in debug mode to capture what the sync is actually doing,
or having issues with. Verify if the data still doesn't show up and send the logs and
sample file in to them for analysis.
They came back with a new syncxmp binary file to slip in, and this was the resolution
to the problem.

I inquired with Xinet engineering about the 'why' and 'how' questions having to do with
the metadata not showing up in Venture. Xinet replied that from their perspective,
the issue had to do with a non-compatible file: specifically,  the jpeg image that I
had sent, which caused syncxmp to fail with these errors:

syncxmp(87535) malloc: *** error for object 0x512060: Non-aligned pointer being freed (2)
syncxmp(87535) malloc: *** error for object 0x512390: double free

Now this error is gobbly-gouk to me. I would have thought a pointer being freed was
a good thing, but I'm not a code writing engineer for several reasons, which is why I
have a very defined escalation path.

The way they "fixed" this was to test my sample file against a newer build of syncxmp,
from the new Suite 16 code, which incorporates some newer XMP libraries provided by Adobe.
In short, the newer Adobe libraries resolved the issue.

So as far as my customer was concerned, they were "not doing anything differently"
But apparently Adobe was. The problem came from the fact that Adobe doesn't stand
still. Ever! They continue to evolve and improve their XMP libraries and those
changes were not recognized by the Xinet version my customer was running.
This is the intrinsic value of having support. We were able to update to a newer
binary to stay current with the ever changing world.

So even though you may not be doing anything different... Change Happens!
My dirt bike solution is to run race gas. The world keeps changing around us without our
consent or input and will not wait for us to adapt or catch up. The leading edge is really
not all that far ahead. But by falling behind, the distance becomes huge and costly.
So stay current with good support.
And if you ride dirt bikes, check your gas.

-Sully

1 Comments Click here to read/write comments

All Posts