jagomart
digital resources
picture1_Alagappan Pdf 116073 | Fast18 Alagappan


 213x       Filetype PDF       File size 0.48 MB       Source: www.usenix.org


File: Alagappan Pdf 116073 | Fast18 Alagappan
protocol aware recovery for consensus based storage ramnatthan alagappan and aishwarya ganesan university of wisconsin madison eric lee university of texas at austin aws albarghouthi university of wisconsin madison vijay ...

icon picture PDF Filetype PDF | Posted on 04 Oct 2022 | 3 years ago
Partial capture of text on file.
               Protocol-Aware Recovery for  
                Consensus-Based Storage
      Ramnatthan Alagappan and Aishwarya Ganesan, University of Wisconsin—Madison;  
    Eric Lee, University of Texas at Austin; Aws Albarghouthi, University of Wisconsin—Madison; 
       Vijay Chidambaram, University of Texas at Austin; Andrea C. Arpaci-Dusseau and  
            Remzi H. Arpaci-Dusseau, University of Wisconsin - Madison
            https://www.usenix.org/conference/fast18/presentation/alagappan
             This paper is included in the Proceedings of the 
         16th USENIX Conference on File and Storage Technologies.
                 February 12–15, 2018 • Oakland, CA, USA
                       ISBN 978-1-931971-42-3
                              Open access to the Proceedings of  
                               the 16th USENIX Conference on  
                               File and Storage Technologies 
                                 is sponsored by USENIX.
                                      Protocol-Aware Recovery for Consensus-Based Storage
                                                                                                         †
                                  Ramnatthan Alagappan, Aishwarya Ganesan, Eric Lee , Aws Albarghouthi,
                               Vijay Chidambaram†, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau
                                        University of Wisconsin – Madison               † University of Texas at Austin
                                           Abstract                                      In this paper, we apply PAR to replicated state ma-
                    Weintroduce protocol-aware recovery (PAR), a new ap-              chine (RSM) systems. We focus on RSM systems for
                    proach that exploits protocol-specific knowledge to cor-           two reasons. First, correctly implementing recovery is
                    rectly recover from storage faults in distributed sys-            mostchallengingforRSMsystemsbecauseofthestrong
                    tems.    We demonstrate the efficacy of PAR through                consistency and durability guarantees they provide [58];
                    the design and implementation of corruption-tolerant              a small misstep in recovery could violate the guaran-
                    replication (CTRL), a PAR mechanism specific to repli-             tees. Second, the reliability of RSM systems is crucial:
                    cated state machine (RSM) systems. We experimentally              many systems entrust RSM systems with their critical
                    show that the CTRL versions of two systems, LogCabin              data [45]. For example, Bigtable, GFS, and other sys-
                    and ZooKeeper, safely recover from storage faults and             tems[7,26]storetheirmetadataonRSMsystemssuchas
                    provide high availability, while the unmodified versions           Chubby [16] or ZooKeeper [4]. Hence, protecting RSM
                    can lose data or become unavailable. We also show that            systems from storage faults such as data corruption will
                    the CTRL versions have little performance overhead.               improve the reliability of many dependent systems.
                                                                                         Wefirst characterize the different approaches to han-
                    1     Introduction                                                dling storage faults by developing the RSM recovery
                    Failure recovery using redundancy is central to improved          taxonomy, through experimental and qualitative analy-
                    reliability of distributed systems [14,22,31,35,61,67].           sis of practical systems and methods proposed by prior
                    Distributed systems recover from node crashes and net-            research (§2). Our analyses show that most approaches
                    work failures using copies of data and functionality on           employed by currently deployed systems do not use any
                    several nodes [6,47,55]. Similarly, bad or corrupted data         protocol-level knowledgetoperformrecovery,leadingto
                    ononenodeshouldberecoveredfromredundantcopies.                    disastrous outcomes such as data loss and unavailability.
                       In a static setting where all nodes always remain                 Thus, to improve the resiliency of RSM systems to
                    reachable and where clients do not actively update data,          storage faults, we design a new protocol-aware recov-
                    recovering corrupted data from replicas is straightfor-           ery approach that we call corruption-tolerant replication
                    ward; in such a setting, a node could repair its state by         or CTRL (§3). CTRL constitutes two components: a lo-
                    simply fetching the data from any other node.                     cal storage layer and a distributed recovery protocol;
                       In reality, however, a distributed system is a dynamic         while the storage layer reliably detects faults, the dis-
                    environment, constantly in a state of flux.          In such       tributed protocol recovers faulty data from redundant
                    settings, orchestrating recovery correctly is surprisingly        copies.   Both the components carefully exploit RSM-
                    hard. As a simple example, consider a quorum-based                specific knowledge to ensure safety (e.g., no data loss)
                    system,inwhichapieceofdataiscorruptedononenode.                   and high availability.
                    Whenthenodetries to recover its data, some nodes may                 CTRL applies several novel techniques to achieve
                    fail and be unreachable, some nodes may have recently             safety and high availability.      For example, a crash-
                    recovered from a failure and so lack the required data or         corruption disentanglement technique in the storage
                    hold a stale version. If enough care is not exercised, the        layer distinguishes corruptions caused by crashes from
                    node could “fix” its data from a stale node, overwriting           disk faults; without this technique, safety violations or
                    the new data, potentially leading to a data loss.                 unavailability could result. Next, a global-commitment
                       To correctly recover corrupted data from redundant             determination protocol in the distributed recovery sepa-
                    copiesinadistributedsystem,weproposethatarecovery                 rates committed items from uncommitted ones; this sep-
                    approach should be protocol-aware. A protocol-aware               aration is critical: while recovering faulty committed
                    recovery (PAR) approach is carefully designed based on            items is necessary for safety, discarding uncommitted
                    howthedistributed system performs updates to its repli-           items quickly is crucial for availability. Finally, a novel
                    cated data, elects the leader, etc. For instance, in the pre-     leader-initiated snapshotting mechanism enables identi-
                    vious example, a PAR mechanism would realize that a               cal snapshots across nodes to greatly simplify recovery.
                    faulty node has to query at least R (read quorum) other              We implement CTRL in two storage systems that are
                    nodes to safely and quickly recover its data.                     based on different consensus algorithms: LogCabin [43]
                    USENIX Association                                       16th USENIX Conference on File and Storage Technologies    15
                   (based on Raft [50]) and ZooKeeper [4] (based on              states by executing commands on a state machine (an in-
                   ZAB [39]) (§4). Through experiments, we show that             memory data structure on each node) [58]. Typically,
                   CTRL versions provide safety and high availability in the     clients interact with a single node (the leader) to exe-
                   presence of storage faults, while the original systems re-    cute operations on the state machine. Upon receiving
                   main unsafe or unavailable in many cases; we also show        a command, the leader durably writes the command to
                   that CTRL induces minimal performance overhead (§5).          an on-disk log and replicates it to the followers. When
                                                                                 a majority of nodes have durably persisted the command
                   2    BackgroundandMotivation                                  in their logs, the leader applies the command to its state
                   Wefirst provide background on storage faults and RSM           machine and returns the result to the client; at this point,
                   systems. We then present the taxonomy of different ap-        the command is committed. The commands in the log
                   proaches to handling storage faults in RSM systems.           have to be applied to the state machine in-order. Losing
                                                                                 or overwriting committed commands violates the safety
                   2.1    Storage Faults in Distributed Systems                  property of the state machine. The replicated log is kept
                   Disksandflashdevicesexhibitasubtleandcomplexfail-              consistent across nodes by a consensus protocol such as
                   ure model: a few blocks of data could become inaccessi-       Paxos [41] or Raft [50].
                   ble or be silently corrupted [8,9,32,59]. Although such         Becausethelogcangrowindefinitelyandexhaustdisk
                   storage faults are rare compared to whole-machine fail-       space, periodically, a snapshot of the in-memory state
                   ures, in large-scale distributed systems, even rare failures  machine is written to disk and the log is garbage col-
                   become prevalent [60,62]. Thus, it is critical to reliably    lected. When a node restarts after a crash, it restores
                   detect and recover from storage faults.                       the system state by reading the latest on-disk snapshot
                     Storage faults occur due to several reasons: media er-      and the log. The node also recovers its critical metadata
                   rors [10], program/read disturbance [60], and bugs in         (e.g., log start index) from a structure called metainfo.
                   firmware [9], device drivers [66], and file systems [27,        Thus, each node maintains three critical persistent data
                   28]. Storage faults manifest in two ways: block errors        structures: the log, the snapshots, and the metainfo.
                   and corruption.   Block errors (or latent sector errors)        These persistent data structures could be corrupted
                   arise when the device internally detects a problem with a     due to storage faults.   Practical systems try to safely
                   block and throws an error upon access. Studies of both        recover the data and remain available under such fail-
                   flash [33,60] and hard drives [10,59] show that block er-      ures [15, 17]. However, as we will show, none of the
                   rors are common. Corruption could occur due to lost and       current approaches correctly recover from storage faults,
                   misdirected writes that may not be detected by the de-        motivating the need for a new approach.
                   vice. Studies [9,51] and anecdotal evidence [36,37,57]        2.3    RSMRecoveryTaxonomy
                   showtheprevalence of data corruption in the real world.
                     Many local file systems, on encountering a storage           To understand the different possible ways to handling
                   fault, simply propagate the fault to applications [11,54,     storage faults in RSM systems, we analyze a broad range
                   64]. For example, ext4 silently returns corrupted data        of approaches. We perform this analysis by two means:
                   if the underlying device block is corrupted. In contrast,     first, we analyze practical systems including ZooKeeper,
                   a few file systems transform an underlying fault into a        LogCabin, etcd [25], and a Paxos-based system [24] us-
                   different one; for example, btrfs returns an error to appli-  ing a fault-injection framework we developed (§5); sec-
                   cations if the accessed block is corrupted on the device.     ond, we analyze techniques proposed by prior research
                   In either case, storage systems built atop local file sys-     or used in proprietary systems [15,17].
                   tems should handle corrupted data and storage errors to         Through our analysis, we classify the approaches into
                   preserve end-to-end data integrity.                           two categories: protocol-oblivious and protocol-aware.
                     One way to tackle storage faults is to use RAID-like        The oblivious approaches do not use any protocol-level
                   storage to maintain multiple copies of data on each node.     knowledge to perform recovery.        Upon detecting a
                   However, many distributed deployments would like to           fault, these approaches take a recovery action locally
                   use inexpensive disks [22, 31]. Given that the data in        on the faulty node; such actions interact with the dis-
                   a distributed system is inherently replicated, it is waste-   tributed protocols in unsafe ways, leading to data loss.
                   ful to store multiple copies on each node. Hence, it is       The protocol-aware approaches use some RSM-specific
                   important for distributed systems to use the inherent re-     knowledge to recover; however, they do not use this
                   dundancy to recover from storage faults.                      knowledge correctly, leading to undesirable outcomes.
                   2.2    RSM-basedStorageSystems                                Ourtaxonomyisnotcompleteinthattheremaybeother
                                                                                 techniques; however, to the best of our knowledge, we
                   Our goal is to harden RSM systems to storage faults.          have not observed other approaches apart from those in
                   In an RSM system, a set of nodes compute identical            our taxonomy.
                   16    16th USENIX Conference on File and Storage Technologies                                     USENIX Association
                      S
                             2 3       2 3      2 3      2 3       2 3      2 3                                                     xity
                        1                                                                                                        ery
                      S                                                                                                    entionv
                          1 2 3     1 2 3     1   3 1       3    1   3 1      3                                               nodes
                        2
                      S
                          1 2 3     1 2 3     1 2 3    1 2 3     1 2      1 2                                                 xtraRecoComple
                        3                                                                                             ailabilityIntervew
                                                                                                                      v
                      S                                                                                                          ast           v)
                          1 2 3                        1 2 3     1 2 3      2 3
                                              1 2 3
                        4                                                                 Class  Approach          SafetyAPerformanceNoNoFLo(i)(ii)(iii)(i(v)(vi)
                      S 1 2 3                                             1   3                                     √√√√ √
                                              1 2 3              1 2 3                           NoDetection     ×             na      E E E E E E
                        5                                                                                        √ √ √ √
                           (i)        (ii)     (iii)     (iv)     (v)      (vi)               viousCrash            × × na             U C U C U U
                                                                                           ProtocolTruncate      ×√√√√×√ CLCLLL
                    Figure 1: SampleScenarios. Thefigureshowssamplescenariosin                 ObliDeleteRebuild  ×√√×√×√ CLCLLL
                    which current approaches fail. Faulty entries are striped. Crashed and                             √√√ √
                    lagging nodes are shown as gray and empty boxes, respectively.               MarkNonVoting   ××            ×       U C U C U U
                                                                                              areReconfigure      √×√×××√ UCUCUU
                       To illustrate the problems, we use Figure 1. In all                    w  Byzantine FT    √××√×na× UCUUUU
                                       †                                                   ProtocolA             √√√√√√√
                    cases, log entries 1, 2, and 3 are committed; losing these                   CTRL                                  C C C C C C
                    items will violate safety. Table 1 shows how each ap-
                    proach behaves in Figure 1’s scenarios. As shown in                    E- Return Corrupted, L- Data Loss, U- Unavailable, C- Correct
                    the table, all current approaches lead to safety violation         Table 1: Recovery Taxonomy.        The table shows how different
                    (e.g., data loss), low availability, or both. A recovery           approaches behave in Figure 1 scenarios. While all approaches are
                    mechanism that effectively uses redundancy should be               unsafe or unavailable, CTRL ensures safety and high availability.
                    safe and available in all cases. Table 1 also compares the         (possibly faulty) portions of data and continue operat-
                    approaches along other axes such as performance, main-             ing. The intuition behind Truncate is that if the faulty
                    tenance overhead (intervention and extra nodes), recov-            data is discarded, the node can continue to operate (un-
                    ery time, and complexity. Although Figure 1 shows only             like Crash), improving availability.
                    faults in the log, the taxonomy applies to other structures           However, we find that Truncate can cause a safety vi-
                    including the snapshots and the metainfo.                          olation (data loss). Consider the scenario shown in Fig-
                    NoDetection. The simplest reaction to storage faults is            ure 2 in which entry 1 is corrupted on S ; S , S are lag-
                    none at all: to trust every layer in the storage stack to                                                      1   4   5
                    workreliably. For example, a few prototype Paxos-based             ging and do not have any entry. Assume S2 is the leader.
                    systems[24]donotusechecksumsfortheiron-diskdata;                   WhenS1readsitslog,itdetectsthecorruption; however,
                    similarly, LogCabin does not protect its snapshots with            S1 truncates its log, losing the corrupted entry and all
                    checksums. NoDetection trivially violates safety; cor-             subsequent entries (Figure 2(ii)). Meanwhile, S2 (leader)
                    rupted data can be obliviously served to clients. How-             and S3 crash. S1, S4, and S5 form a majority and elect S1
                    ever, deployed systems do use checksums and other in-              the leader. Now the system does not have any knowledge
                    tegrity strategies for most of their on-disk data.                 of committedentries 1, 2, and 3, resulting in a silent data
                    Crash. A better strategy is to use checksums and han-              loss. The system also commits new entries x, y, and z in
                    dle I/O errors, and crash the node on detecting a fault.           the place of 1, 2, and 3 (Figure 2(iii)). Finally, when S2
                    Crash may seem like a good strategy because it in-                 and S3 recover, they follow S1’s log (Figure 2(iv)), com-
                    tends to prevent any damage that the faulty node may               pletely removing entries 1, 2, and 3.
                    inflict on the system. Our experiments show that the                   In summary, although the faulty node detects the cor-
                    Crash approach is common: LogCabin, ZooKeeper, and                 ruption, it truncates its log, losing the data locally. When
                    etcd crash sometimes when their logs are faulty. Also,             this node forms a majority along with other nodes that
                    ZooKeeper crashes when its snapshots are corrupted.                are lagging, data is silently lost, violating safety. We find
                       Although Crash preserves safety, it suffers from se-            this safety violation in ZooKeeper and LogCabin.
                    vere unavailability. Given that nodes could be unavail-               Further, Truncate suffers from inefficient recovery.
                    able due to other failures, even a single storage fault re-        For instance, in Figure 1(i), S1 truncates its log after a
                    sults in unavailability, as shown in Figure 1(i). Similarly,       fault, losing entries 1, 2, and 3. Now to fix S1’s log,
                    a single fault even in different portions of data on a ma-         the leader needs to transfer all entries, increasing S1’s re-
                    jority (e.g., Figure 1(v)) renders the system unavailable.         coverytimeandwastingnetworkbandwidth. ZooKeeper
                    Note that simply restarting the node does not help; stor-          and LogCabin suffer from this slow recovery problem.
                    age faults, unlike other faults, could be persistent: the          DeleteRebuild. Another commonly employed action is
                    node will encounter the same fault and crash again until           to manually delete all data on the faulty node and restart
                    manualintervention, which is error-prone and may cause             the node. Unfortunately, similar to Truncate, DeleteRe-
                    a data loss. Thus, it is desirable to recover automatically.       build can violate safety; specifically, a node whose data
                    Truncate. A more sophisticated action is to truncate               is deleted could form a majority along with the lagging
                                                                                       nodes, leading to a silent data loss. Surprisingly, admin-
                       †Alogentrycontains a state-machine command and data.            istrators often use this approach hoping that the faulty
                    USENIX Association                                        16th USENIX Conference on File and Storage Technologies    17
The words contained in this file might help you see if this file matches what you are looking for:

...Protocol aware recovery for consensus based storage ramnatthan alagappan and aishwarya ganesan university of wisconsin madison eric lee texas at austin aws albarghouthi vijay chidambaram andrea c arpaci dusseau remzi h https www usenix org conference fast presentation this paper is included in the proceedings th on file technologies february oakland ca usa isbn open access to sponsored by abstract we apply par replicated state ma weintroduce a new ap chine rsm systems focus proach that exploits specic knowledge cor two reasons first correctly implementing rectly recover from faults distributed sys mostchallengingforrsmsystemsbecauseofthestrong tems demonstrate efcacy through consistency durability guarantees they provide design implementation corruption tolerant small misstep could violate guaran replication ctrl mechanism repli tees second reliability crucial cated machine experimentally many entrust with their critical show versions logcabin data example bigtable gfs other zookeeper ...

no reviews yet
Please Login to review.