0%

tla+

分布式协议很难证明其正确性,常规想法是通过测试尽可能覆盖所有可能。但实现是,再多的测试也很难覆盖住所有的可能,如果在线上暴露问题,会造成严重的后果。形式化验证正是为了解决这样的问题,它使用计算机强大的计算能力,暴力的搜索所有可能的行为,检查是否满足事先设定的属性,任何不符合预期的行为都能被发现,从根本上保证算法的正确性。

一句话说,在理论上通过暴力搜索所有的可能保证协议理论的正确性,之后只需要考虑如何正确的实现协议。在出问题时不需要考虑协议是否有问题,而是代码实现的有问题。

TLA is a language for high-level modeling of digital systems. High-level means: at the design level; above the code level.

can help find and correct design errors: errors hard to find by testing; before writing any code

TLA简介

TLA+(Temporal Logic of Actions) 是Leslie Lamport开发的一门形式化验证语言,用于程序的设计、建模、文档和验证等,特别是并发系统和分布式系统。TLA+的设计初衷是用简单的数学理论和公式精准地对系统进行描述。

使用TLA+对程序进行形式化验证,首先要用TLA+对程序进行描述,这样的描述称为规范(Specification)。有了Specification以后就可以使用TLC模型检查器来运行它,运行的过程会遍历所有可能的行为,检查Specification中设定的属性,发现非预期的行为。

TLA+在学术界和工业界都有着广泛的应用。(TLA+ Examples)[https://lamport.azurewebsites.net/tla/tla.html]给出了一些使用TLA+验证过的分布式算法和并发算法。很多分布式算法论文在非形式化的论证介绍之外, 会附带TLA+的Specification来证明自己的算法是经过形式化验证的。

TLA videos

introduction

  • TLA is a language for high-level modeling of digital systems. High-level means: at the design level; above the code level.
  • can help find and correct design errors
    • errors hard to find by testing
    • before writing any code
  • abstraction helps us understand complex systems
  • we use TLA+ to ensure the systems we build “work right”. Working right means satifying certain properties.
  • an exeution of a system is represented as a sequence of discret steps.
    • strange to describe a concurrent system, as a sequence of steps.
    • TLA+ descirbes a step as a state change, an execution is represented as a sequence of states. A step is the change from one state to the next.
    • TLA+ descibes a state as an assignment of values to variables
  • [define] call a sequence of states a behavior
  • [define] a state is an assignment of values to variables, so a state machine is described by:
    • what the varables are.
    • possible initial values of variables.
    • a realation between their values in the current state and their possible values in the next state.

state machine in TLA+

1
2
3
4
5
6
7
8
9
10
11
12
13
------ module SimpleProgram -------
EXTENDS Integers
VARIABLES i, pc

Init == (pc == "start") /\ (i=0)

Next == \/ /\ pc = "start"
/\ i' \in 0..10000
/\ pc' = "middle"
\/ /\ pc = "middle"
/\ i' = i + 1
/\ pc' = "done"
===================================

resource and tools

  • ToolboxThe TLA Toolbox is an IDE (integrated development environment) for the TLA+ tools
  • example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
EXTENDS Integers
VARIABLES i, pc

Init == (pc = "start") /\ (i = 0)

Pick == /\ pc = "start"
/\ i' \in 0..1000
/\ pc' = "middle"

Add1 == /\ pc = "middle"
/\ i' = i + 1
/\ pc' = "done"

Next == Pick \/ Add1
  • TLC report deadlock means execution stopped when it wasn’t supposed to. termination means execution stopped when it was supposed to
  • TLAPS (TLA proof system) can check proofs of correctness of algorithm

transaction commit

  • All resources managers must agree on whether it committed or aborted.
  • in tla+, every value is a set
  • [var |-> expr] means lambda function, for example: sqr == [ i \in 1…42 |-> i^2] equal sqr[i] = i^2 for all i form 1 to 42.
  • terminology:
Programming Math
array function
index set domain
f[e] f(e)

note: f(e) in tla has another use.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
------------------------------ MODULE TCommit ------------------------------
(***************************************************************************)
(* This specification is explained in "Transaction Commit", Lecture 5 of *)
(* the TLA+ Video Course. *)
(***************************************************************************)
CONSTANT RM \* The set of participating resource managers

VARIABLE rmState \* rmState[rm] is the state of resource manager rm.
-----------------------------------------------------------------------------
TCTypeOK ==
(*************************************************************************)
(* The type-correctness invariant *)
(*************************************************************************)
rmState \in [RM -> {"working", "prepared", "committed", "aborted"}]

TCInit == rmState = [r \in RM |-> "working"]
(*************************************************************************)
(* The initial predicate. *)
(*************************************************************************)

canCommit == \A r \in RM : rmState[r] \in {"prepared", "committed"}
(*************************************************************************)
(* True iff all RMs are in the "prepared" or "committed" state. *)
(*************************************************************************)

notCommitted == \A r \in RM : rmState[r] # "committed"
(*************************************************************************)
(* True iff no resource manager has decided to commit. *)
(*************************************************************************)
-----------------------------------------------------------------------------
(***************************************************************************)
(* We now define the actions that may be performed by the RMs, and then *)
(* define the complete next-state action of the specification to be the *)
(* disjunction of the possible RM actions. *)
(***************************************************************************)
Prepare(r) == /\ rmState[r] = "working"
/\ rmState' = [rmState EXCEPT ![r] = "prepared"]

Decide(r) == \/ /\ rmState[r] = "prepared"
/\ canCommit
/\ rmState' = [rmState EXCEPT ![r] = "committed"]
\/ /\ rmState[r] \in {"working", "prepared"}
/\ notCommitted
/\ rmState' = [rmState EXCEPT ![r] = "aborted"]

TCNext == \E r \in RM : Prepare(r) \/ Decide(r)
(*************************************************************************)
(* The next-state action. *)
(*************************************************************************)
-----------------------------------------------------------------------------
TCConsistent ==
(*************************************************************************)
(* A state predicate asserting that two RMs have not arrived at *)
(* conflicting decisions. It is an invariant of the specification. *)
(*************************************************************************)
\A r1, r2 \in RM : ~ /\ rmState[r1] = "aborted"
/\ rmState[r2] = "committed"
-----------------------------------------------------------------------------
(***************************************************************************)
(* The following part of the spec is not discussed in Video Lecture 5. It *)
(* will be explained in Video Lecture 8. *)
(***************************************************************************)
TCSpec == TCInit /\ [][TCNext]_rmState
(*************************************************************************)
(* The complete specification of the protocol written as a temporal *)
(* formula. *)
(*************************************************************************)

THEOREM TCSpec => [](TCTypeOK /\ TCConsistent)
(*************************************************************************)
(* This theorem asserts the truth of the temporal formula whose meaning *)
(* is that the state predicate TCTypeOK /\ TCInvariant is an invariant *)
(* of the specification TCSpec. Invariance of this conjunction is *)
(* equivalent to invariance of both of the formulas TCTypeOK and *)
(* TCConsistent. *)
(*************************************************************************)

=============================================================================

Two Phase Commit

  • TLC will check fewer states if the model sets a sysmmetry set to a set of model values
  • TLC will miss errors if you claim a set is a symmetry set when it’s not
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
------------------------------ MODULE TwoPhase ------------------------------

(***************************************************************************)
(* This specification is discussed in "Two-Phase Commit", Lecture 6 of the *)
(* TLA+ Video Course. It describes the Two-Phase Commit protocol, in *)
(* which a transaction manager (TM) coordinates the resource managers *)
(* (RMs) to implement the Transaction Commit specification of module *)
(* TCommit. In this specification, RMs spontaneously issue Prepared *)
(* messages. We ignore the Prepare messages that the TM can send to the *)
(* RMs. *)
(* *)
(* For simplicity, we also eliminate Abort messages sent by an RM when it *)
(* decides to abort. Such a message would cause the TM to abort the *)
(* transaction, an event represented here by the TM spontaneously deciding *)
(* to abort. *)
(***************************************************************************)
CONSTANT RM \* The set of resource managers

VARIABLES
rmState, \* rmState[r] is the state of resource manager r.
tmState, \* The state of the transaction manager.
tmPrepared, \* The set of RMs from which the TM has received "Prepared"
\* messages.
msgs
(***********************************************************************)
(* In the protocol, processes communicate with one another by sending *)
(* messages. For simplicity, we represent message passing with the *)
(* variable msgs whose value is the set of all messages that have been *)
(* sent. A message is sent by adding it to the set msgs. An action *)
(* that, in an implementation, would be enabled by the receipt of a *)
(* certain message is here enabled by the presence of that message in *)
(* msgs. For simplicity, messages are never removed from msgs. This *)
(* allows a single message to be received by multiple receivers. *)
(* Receipt of the same message twice is therefore allowed; but in this *)
(* particular protocol, that's not a problem. *)
(***********************************************************************)

Messages ==
(*************************************************************************)
(* The set of all possible messages. Messages of type "Prepared" are *)
(* sent from the RM indicated by the message's rm field to the TM. *)
(* Messages of type "Commit" and "Abort" are broadcast by the TM, to be *)
(* received by all RMs. The set msgs contains just a single copy of *)
(* such a message. *)
(*************************************************************************)
[type : {"Prepared"}, rm : RM] \cup [type : {"Commit", "Abort"}]

TPTypeOK ==
(*************************************************************************)
(* The type-correctness invariant *)
(*************************************************************************)
/\ rmState \in [RM -> {"working", "prepared", "committed", "aborted"}]
/\ tmState \in {"init", "done"}
/\ tmPrepared \subseteq RM
/\ msgs \subseteq Messages

TPInit ==
(*************************************************************************)
(* The initial predicate. *)
(*************************************************************************)
/\ rmState = [r \in RM |-> "working"]
/\ tmState = "init"
/\ tmPrepared = {}
/\ msgs = {}
-----------------------------------------------------------------------------
(***************************************************************************)
(* We now define the actions that may be performed by the processes, first *)
(* the TM's actions, then the RMs' actions. *)
(***************************************************************************)
TMRcvPrepared(r) ==
(*************************************************************************)
(* The TM receives a "Prepared" message from resource manager r. We *)
(* could add the additional enabling condition r \notin tmPrepared, *)
(* which disables the action if the TM has already received this *)
(* message. But there is no need, because in that case the action has *)
(* no effect; it leaves the state unchanged. *)
(*************************************************************************)
/\ tmState = "init"
/\ [type |-> "Prepared", rm |-> r] \in msgs
/\ tmPrepared' = tmPrepared \cup {r}
/\ UNCHANGED <<rmState, tmState, msgs>>

TMCommit ==
(*************************************************************************)
(* The TM commits the transaction; enabled iff the TM is in its initial *)
(* state and every RM has sent a "Prepared" message. *)
(*************************************************************************)
/\ tmState = "init"
/\ tmPrepared = RM
/\ tmState' = "done"
/\ msgs' = msgs \cup {[type |-> "Commit"]}
/\ UNCHANGED <<rmState, tmPrepared>>

TMAbort ==
(*************************************************************************)
(* The TM spontaneously aborts the transaction. *)
(*************************************************************************)
/\ tmState = "init"
/\ tmState' = "done"
/\ msgs' = msgs \cup {[type |-> "Abort"]}
/\ UNCHANGED <<rmState, tmPrepared>>

RMPrepare(r) ==
(*************************************************************************)
(* Resource manager r prepares. *)
(*************************************************************************)
/\ rmState[r] = "working"
/\ rmState' = [rmState EXCEPT ![r] = "prepared"]
/\ msgs' = msgs \cup {[type |-> "Prepared", rm |-> r]}
/\ UNCHANGED <<tmState, tmPrepared>>

RMChooseToAbort(r) ==
(*************************************************************************)
(* Resource manager r spontaneously decides to abort. As noted above, r *)
(* does not send any message in our simplified spec. *)
(*************************************************************************)
/\ rmState[r] = "working"
/\ rmState' = [rmState EXCEPT ![r] = "aborted"]
/\ UNCHANGED <<tmState, tmPrepared, msgs>>

RMRcvCommitMsg(r) ==
(*************************************************************************)
(* Resource manager r is told by the TM to commit. *)
(*************************************************************************)
/\ [type |-> "Commit"] \in msgs
/\ rmState' = [rmState EXCEPT ![r] = "committed"]
/\ UNCHANGED <<tmState, tmPrepared, msgs>>

RMRcvAbortMsg(r) ==
(*************************************************************************)
(* Resource manager r is told by the TM to abort. *)
(*************************************************************************)
/\ [type |-> "Abort"] \in msgs
/\ rmState' = [rmState EXCEPT ![r] = "aborted"]
/\ UNCHANGED <<tmState, tmPrepared, msgs>>

TPNext ==
\/ TMCommit \/ TMAbort
\/ \E r \in RM :
TMRcvPrepared(r) \/ RMPrepare(r) \/ RMChooseToAbort(r)
\/ RMRcvCommitMsg(r) \/ RMRcvAbortMsg(r)
-----------------------------------------------------------------------------
(***************************************************************************)
(* The material below this point is not discussed in Video Lecture 6. It *)
(* will be explained in Video Lecture 8. *)
(***************************************************************************)

TPSpec == TPInit /\ [][TPNext]_<<rmState, tmState, tmPrepared, msgs>>
(*************************************************************************)
(* The complete spec of the Two-Phase Commit protocol. *)
(*************************************************************************)

THEOREM TPSpec => []TPTypeOK
(*************************************************************************)
(* This theorem asserts that the type-correctness predicate TPTypeOK is *)
(* an invariant of the specification. *)
(*************************************************************************)
-----------------------------------------------------------------------------
(***************************************************************************)
(* We now assert that the Two-Phase Commit protocol implements the *)
(* Transaction Commit protocol of module TCommit. The following statement *)
(* imports all the definitions from module TCommit into the current *)
(* module. *)
(***************************************************************************)
INSTANCE TCommit

THEOREM TPSpec => TCSpec
(*************************************************************************)
(* This theorem asserts that the specification TPSpec of the Two-Phase *)
(* Commit protocol implements the specification TCSpec of the *)
(* Transaction Commit protocol. *)
(*************************************************************************)
(***************************************************************************)
(* The two theorems in this module have been checked with TLC for six *)
(* RMs, a configuration with 50816 reachable states, in a little over a *)
(* minute on a 1 GHz PC. *)
(***************************************************************************)

=============================================================================
\* Modification History
\* Last modified Fri Dec 10 11:31:27 CST 2021 by thegloves
\* Created Fri Dec 10 11:31:00 CST 2021 by thegloves

Paxos

  • finding fault-tolerant distributed algorithmns is hard. They’re easy to get wrong, and hard to find errors by tesing