MapR CTO, M.C. Srivas talks about the MapR architecture.
Watch
Production success with Hadoop requires a platform that not only stores data as system of record and ensures applications run 24x7, but also a platform that allows for easy integration with the rest of the enterprise data architecture and tools. MapR is the only distribution to provide these capabilities out-of-the-box and its architecture is key to how it achieves it.
Building a highly reliable, high performance cluster using commodity servers takes massive engineering effort. Data as well as the services running on the cluster must be made highly available and fully protected from node failures without affecting the overall performance of the cluster. Achieving such speed and reliability on commodity servers is even more challenging because of lack of non-volatile RAM or any specialized connectivity between nodes to deploy redundant data paths or RAID configurations.
In order to achieve reliability with commodity hardware, one has to resort to a replication scheme, where multiple redundant copies of data are stored. In normal operations, the cluster needs to write and replicate incoming data as quickly as possible, provide consistent reads and ensure metadata and services information are also redundantly maintained. When crashes occur, the customer is exposed to data loss until the replica is recovered and resynchronized. The mean time to data loss (MTTDL) is a measure of this recovery. In essence, the goal is to maximize MTTDL by quickly resynchronizing the lost replica without impacting overall cluster performance and reliability.
HDFS takes a different approach to this issue. HDFS completely bypasses the core problem by making everything read-only - so it is trivial to re-sync. There’s a single writer per file - no reads are allowed to a file while its being written - and the file close is the transaction that allows readers to see the data. When a failure occurs, unclosed files are deleted. The model assumes that if nobody saw the mutated data, then it is ok to delete it. It is as if the data didn’t exist at all.
This approach satisfied HDFS’s original goal of web-crawling, where if a downloaded web page were lost, it would be downloaded on the next crawl. But would a CIO or CFO in an enterprise dare take a chance with this model for critical business information? Can you really say that if no one read the data that it probably didn’t exist at all? Clearly NO.
The limited semantics of HDFS also surfaces a couple of problems:
Real-time is not possible with HDFS. In order to make data visible in HDFS, you have to close the file immediately after writing, so you will be forced to write a little amount, and close the file—and repeat the process again. The problem is that you end up creating too many files, which is a serious, documented problem with HDFS because of the centralized metadata storage architecture.
Furthermore, HDFS cannot truly support read-write via NFS because the NFS protocol cannot invoke a “close” operation on the file when writing to HDFS. What this limitation means is that HDFS has to take a guess as to when to close the file. If it guesses wrong, you will lose data.
So it is evident that we need a different approach to solve this problem while maintaining core file-system capabilities.
First, let’s take a look at how long it actually takes to re-sync the data. As an example, let’s say you have 24TB on a server. 24TB is really a server with 12 drives, each the size of 2 TB. At 1GB per second, it would take 7 hours to re-sync. But in practical terms, you’ll get a throughput of 200MB/second, so that means it will actually take 35 hours. If you want to do this online, you would have to throttle the re-sync rate to 1/10th, which means it will take 350 hours to re-sync (15 days).
So what is your mean time to data loss (MTTDL)? Consider a drive that fails once every three years. The mean time to failure for the drive is 3 years. Now consider a cluster of 100 nodes, where each node has 10 drives, so you have 1000 drives in the cluster. Given that it takes 3 years for a drive to fail, you end up with a failed drive every day. So how long do you have to wait for a double disk or triple disk failure? Can you really wait 15 days to re-sync while 15 other drives have already failed? You most likely will lose data with such speeds.
The traditional approach to side-stepping this fast re-sync problem is to use a dual-ported disk that runs Raid-6 with idle spares. The dual ported disk array connects the two servers, one to each of the ports. The servers use NVRAM, which is non-volatile RAM, to manage the disks. The primary copies the NVRAM over to the replica continuously. When a primary or replica fails, the other one takes over and there is nothing to re-sync, because it has everything it needs for the drives to work. This scenario can work, but it is not scalable because you now have to enter into large purchase contracts with a multi-year spare-parts plan.
Moreover, with the above solution, performance suffers because you’re back to the traditional architecture where you have a SAN or a NAS that’s connected to a database with a bunch of applications connected to it and data is being constantly moved to where processing needs to happen. Such a model does not provide high performance at scale. What you really need is a distributed storage as well as processing platform such as Hadoop, where the functionality is run locally on the data, and the system scales linearly to extreme limits - even to geographically dispersed locations.
MapR implements normal files, where a write is visible as soon as it is written. But to solve the resynchronization problem with commodity hardware at high speeds and reliability, MapR created a new architecture.
MapR chops the data on each node into 1000s of pieces called containers. These containers are replicated across the cluster. As an example, let’s say we have a 100-node cluster and we divide each node’s data into 1000 pieces, and spread it out evenly. So each node ends up holding 1/100th of every node’s data.
When a server dies, the entire cluster re-syncs the dead node’s data. What’s the speed of this kind of re-sync? You have 99 nodes re-syncing in parallel, with 99x number of drives, 99x of Ethernet ports, and 99x CPUs, 99X VRAM modules. Each node is re-syncing 1/100th of the data. So the net speed-up is about 100x: meaning that the MTTDL is 100x better. If cluster size is made bigger, this MTTDL only gets better. For a 1000-node cluster, the MTTDL is actually 1000x better.
MapR re-syncs its containers at high speeds even in extreme conditions. For instance, let's look at the three containers in the figure that are three replicas and these replicas are progressing in-sync. If one of them fails, it’s not such a hard problem; the others will bring it back up to snuff. But let’s say after they’re running for a while, all three of them crash. Since we have a lot of different operations running at the same time in the cluster, it’s very likely all three will diverge in different directions. So the problem is now that once we’ve recovered, which one should be the master?
The issue at hand is we don’t know which replica(s) will survive the crash. Which ones will come back? Therefore, the software must be able to make ANY one of the replicas the new master, completely at random. MapR innovations and patents revolve around solving this very difficult computer science problem.
We essentially have to ensure that every write that was replied to must survive the crash. MapR has the capability to detect the exact point at which replicas diverge, even at a 2 GB per second update rate. MapR can randomly pick any one of the three copies as the new master, roll back the other surviving replicas to the divergence point, and then roll-forward to converge with the chosen master. MapR can do this on the fly, with very little impact on normal operations—and this is very difficult to accomplish.
MapR leverages its architectural innovations to provide unique set of features for the Hadoop user.
MapR Container Architecture outperforms NAS, SAN, or HDFS approach to storing metadata. We’ve taken the NameNode data and laid it out on the cluster, and took whatever was inside the NameNode (the grey areas on the left side of the figure below) , and moved it into the regular cluster. Now, everything is triplicated and ultra-reliable with MapR Distribution’s no-NameNode solution. With no NameNode, there are no practical limits to the number of files that can be stored on MapR either, letting you go all the way to exabyte scale. An additional major benefit of this is much lower cost because of less hardware in the cluster compared to HDFS where you require multiple NameNodes to deal with the file limit at scale, and multiple active standby servers to implement NameNode HA.
All of the Hadoop components also enjoy High Availability. For example, with Apache YARN, the ApplicationMaster (the old Job Tracker), and TaskTracker record their state in MapR. On a node failure, the application master recovers its state from MapR. So MapR uniquely allows all jobs to resume from where they originally were before the crash. This works even when the entire cluster is re-started. So if you have a job that’s gone 80% of the way, and the cluster crashes—when you come back, it’s going to continue from 80% onwards.
It’s also quite simple to exploit MapR Distribution’s unique HA advantages for your code as well. Save your service state and the data in MapR, use Zookeeper to notice service failures, and restart the service anywhere. The data and state will move there automatically. It’s only with MapR that you’ll get HA for Impala, Hive, Oozie, Storm, MySQL, SOLR/Lucerne, Kafka, etc. The point here is that it’s not a one-off solution for each project in the Hadoop ecosystem. It’s the same infrastructure underneath, and every piece and parcel of Hadoop can exploit it, just like your code can.
The figure below shows a NameNode benchmark that writes 100 bytes, and closes the file. This clearly illustrates the problem of HDFS trying to be real-time. In the test we write a bunch of files, and close them quickly. The test was performed on a 10-node cluster. The blue line shows MapR performance, and the red line at the bottom is a HDFS-based distribution. On the x-axis is the number of files created, and the y-axis shows the number of file creates per second. MapR has a 40x advantage over other distributions. Even if HDFS doubles, triples, or quadruples its performance, MapR is still an order of magnitude faster.
In terms of speed, MapR holds MapReduce world records. The TeraSort record is a record of how fast you can sort 1 TB. MapR holds the latest record at 45 seconds! The MinuteSort record is another benchmark that measures how much you can sort in one minute. MapR holds the record there as well.
MapR architecture preserves and therefore exploits its read-write file system capability to provide standard NFS interface for cluster access so that you can directly deposit data into the cluster without needing any special connectors. Whether it’s a web server, or a database server, or an application server, they can all dump data into MapR directly at very high speeds. MapR can behave like a giant NFS server that’s fully HA. This saves a lot of time and energy in trying to fork-lift data into Hadoop. With MapR, you can do it directly, with no intermediate steps needed.
MapR introduced database tables in the MapR M7 Enterprise Database Edition, which are binary compatible with Apache HBase, and are simply accessed by using a different path name. What’s unique about M7 is that the tables are integrated into the storage system; they’re built on the container architecture, and are always available on every node providing 100% data locality. There’s no extra administration for database tables and you can have an unlimited number of tables once again going all the way up to trillion tables. More importantly, because MapR can update in place there are no compactions to deal with. Because of this ability, we get a 5x better performance with excellent (95th and 99th percentile) latencies.
MapR is the technology leader for Hadoop because of a robust foundational architecture that cannot be built through bolt-on software patches. Case-in-point: Over the years, other distributions (all of which rely on HDFS) have attempted to catch up with MapR platform’s core functionality by providing sub-optimal point-solutions: there is one HA solution for NameNode, a different one for JobTracker, completely different HA architecture for HBase. Moreover these solutions are incomplete at best, as is the case with other attempts at providing basic enterprise-grade functionality such as snapshots, which cannot provide point-in-time recovery, and NFS access, which is read-only with low performance.
Architecture matters for a big data platform that serves the needs of a wide array of big data users that need their platform to run all the time, provide the best performance and is easy to work with. Only the MapR Converged Data Platform gives you the architectural capabilities to use low cost commodity hardware, while getting enterprise-class reliability with instant-restart, snapshots, mirrors, and no single-point-of-failure, as well as access to standard protocols like NFS, ODBC, and Hadoop.