source: http://www.cogs.susx.ac.uk/courses/dist-sys/node1.html
Main Points- What we are going to cover
- What distributed systems are
- What characteristics well-designed systems have
What are we going to do?
- 1 lectures on computer communications and networks and loosely
coupled systems, revising material from the Multimedia
Communications Technology course. The Postgraduate course will get
an additional two hours in their seminar slot.
- 5 lectures on Remote Procedure Call, distributed objects and
classic distributed systems such as NFS
- 3 lectures on integrating distributed systems through the web,
content distribution networks and Peer to Peer computing.
- 5 lectures on closely coupled systems for distributed
transactions
- 1 lecture on recent advances in networks and distributed
systems
2.1.2 What's a Distributed System
A distributed system: physically separate computers working
together
- Cheaper and easier to build lots of simple computers
- Easier to add power incrementally
- Machines may necessarily be remote, but system should work together
- Higher availability - one computer crashes, others carry on
working
- Better reliability - store data in multiple locations
- More security - each piece easier to secure to right level.
In real life, can get:
- Worse availability - every machine must be up. ``A distributed
system is one where some machine you've never heard of fails and
prevents you from working''
- Worse reliability
- Worse security
Problem: Coordination is more difficult because multiple
people involved, and communication is over network.
Your Task: What are the distributed interactions when you
login at an x-terminal?
Simplified interactions
Electronic MailMail delivered to remote mailbox. Requires
global name space to identify users, transport mechanisms to get
mail to mailbox
Distributed Information - WWWRemote information hidden below
hypertext browser. Caching and other features operate transparently
Distributed File SystemFiles stored on many machines,
generally not machine you're working on. Files accessed
transparently by OS knowing they're remote and doing remote
operations on them such as read and write e.g. Network File System
(NFS)
Trading Floor SystemBids made, stocks sold, screens updated.
2.1.3.1 Network assumptions
- The network is reliable
- Latency is zero
- Bandwidth is infinite
- The network is secure
- Topology doesn't change
- There is one administrator
- Transport cost is zero
- The network is homogeneous
(
Source:The Eight Fallacies of Distributed Computing -
Peter Deutsch
- Resource Sharing
- Openness
- Concurrency
- Scalability
- Fault Tolerance
- Transparency
Your Task: order the importance of each of these features for the
example systems in the previous slide.
Another way to view a system...
- Communications system
- Messages
- Machines
- Processes on Machines
- Programs
- People
- Its difficult to design a good distributed system: there are a
lot of problems in getting ``good'' characteristics, not the least of
which is people
- Over the next ten weeks you will gain some insight into how to
design a good distributed system.
Main Points- Bits and Bytes
- Kilo, Mega, Giga, Tera
- Memory, Packets and Bit Manipulation
- All data in computers is held as a sequence of ones and zeros.
- A number is represented as the base 2 number held in some
pre-ordained fixed number of bits
- Almost all other data is represented as various sets of
numbers.
- A byte is 8 bits sequenced together - what is the
maximum number?
- Integers (in Java and most other languages nowadays)
are 32 bits long - what is the maximum number?
- In communications we talk often about throughput in
bits/second, or moving files of some particular size. We use
magnitude prefices for convenience
kilo1000 x
mega1000000 x
giga
109 x
tera
1012 x
- There is often confusion as to whether a kilobyte is 1000 or
1024 bytes. When dealing with processor architectures, its
generally 1024. When dealing with communications, its
generally 1000. State assumptions if it is not obvious from
context.
2.2.3 Memory and Packets
- A computer stores data as bits in memory.
- When it wants to send this data to another computer, it copies
the bits into the memory of the communications device.
- The communications device sends the bits through the network
to the other machine (we'll cover the details of this in the
coming week).
- The other machine's communication device places the bits
into memory which the other machine can access.
- The tricky bits come in ensuring that both machines
interpret the bits correctly.
2.2.4 Bit Manipulation
- Not only must the data be sent, but accompnying information
allowing the computers to interpret the context of the data.
- Communications software must be able to pick out arbitrary bits
from an opaque bit sequence and interpret their meaning.
- We do this using bitwise operations - and, or, exclusive or,
negation.
Aim: How do we send messages between computers across a network?
Physical Concepts:Bandwidth, Latency
Packet Concepts:Data, Headers
Routing Concepts:Shared Media, Switches and Routing Tables,
Routing Protocols
For more detailed notes, see Dan Chalmers'
Multimedia Communications Technology notes in
http://www.informatics.sussex.ac.uk/courses/mct/.
2.3.1 Physical Concepts
What is a message?
- A piece of information which needs to move from one process to
each other eg a request to open a file, an email message, the
results of a remote print request.
For the network, this is a just a sequence of bits in memory.
Need to communicate this sequence of bits across communications
system to other host.
Signal to other side whether each bit in sequence is
0 or 1.
To communicate, need each end to each access a common substrate, and
then for the sender to change the value of a physical characteristic
of the substrate.
Examples - voltage level of a piece of wire, light level in optical
fibre, frequency of radio wave in air
If physical characteristic has two levels then signal is binary.
If it has more than one level, can encode a number of bits to each
level.
If 4 levels, then level 0 = 00, level 1 = 01, level 2 = 10, level 3 =
11
If 8 levels, how many bits can be encoded?
Adjusting the level at one end, and recognising the level has changed
at the other takes finite time.
The finite time limits the speed at which bits can be signalled.
The rate of signalling is the bandwidth, measured in bits per second.
If it takes 20 microseconds to raise the voltage on a wire, and 10
milliseconds to recognise the new level, what is the possible
bandwidth if there are two levels? Eight levels?
Can we get infinite bandwidth by increasing the number of levels?
No, since noise makes it impossible to correctly distinguish level.
Noise is random changes in the physical characteristic to which all
physical phenomena are prone.
Always some probability that level will be misinterpreted.
Always some probability of error in message.
Goal of communications engineers is to make this probabilty as low as
necessary.
Does signal propagate along wire infinitely fast?
No, limit to speed of propagation of light. Hence described
as propagation delay.
Latency is time taken for signal to travel from one end of
communication system to destination.
Since communication system may reconstruct message at intermediate
points, time taken to reconstruct message is also part of latency.
Known as switching delay
Your Task: Describe the bandwidth, propagation delay and
switching delay in a game of chinese whispers.
2.3.2 Packets
If there is an error in a message, how is it detected and rectified?
- Compute a checksum over the message
- send the checksum with the message
- Calculate a new checksum over the received message
- Compare the checksums - if different, then message is in error
- Ask for the message to be resent
Probability of error rises with length of message,
Thus message sent in separate lumps with maximum number of bits
per lump, known as packets.
If the message fits in one packet, good. Otherwise message is in many
packets.
2.3.2.1 Addressing
How do we direct the packet to correct recipient(s)?
- Put header bits on front of packet, analogous to address and
other information on envelope in postal system. Add source of
packet to allow returns
- Destination | Source | Packet body
- If destination and source share the same physical medium, then
destination can listen for its own address (ethernet, wireless,
single point to point link)
- If they don't share the same LAN, we use routing
NameIdentifier of object eg Ian Wakeman, Mum
AddressLocation of object eg Rm 4C6, COGS
RouteHow to get there from here - ``turn left, first right and
up stairs, first door on left''
Internet: Name is a Domain Name, such as www.cogs.susx.ac.uk. More
later.
To get packet through network, turn name into address by asking Domain
Name Service (DNS).
Place address in packet header and hand to nearest router.
Router locates MAC address corresponding to IP address, if they're on
the same LAN.
2.3.2.3 Indirect Addressing
2.3.2.4 Architecture of a switch
A switch is a specialised computer. Can turn a PC into a router,
using free software.
- A packet arrives on a link
- The switch gets the destination of the packet from the packet
header
- The switch looks up the destination in a routing table in
memory and discovers the output link
- The switch queues the packet to be sent on the output link
2.3.2.5 Distributed Routing Table Maintenance
The situation:
- People are network administrators (and end users)
- Communication systems are links (possibly multipoint)
- Machines are switches
- Messages may be lost
The problem: Given an address, how does a router know which output
link to send the packet on?
Choices:
- Packet could contain list of all output links - source routing.
Requires source to lookup and determine route. May be good thing.
- Router could look up address in local routing table, and send out of
corresponding link. If not in table, send out default link.
How do we construct tables? Could install entries by hand, known as
static routing.
But
- Limited to entries people know about.
- Unable to ensure consistency and absence of loops
- Unable to respond to changing topology, sharks gnawing through undersea
cables etc.
- Internet has no centralised authority
So we use distributed routing algorithm
- Each switch knows its own address
- Each link has''cost'', such as a value of 1 per link, or measure
of delay.
- Each switch starts out with distance vector, consisting of 0 for
itself, and infinity for everyone else
- Switches exchange distance vectors with neighbour switches, and
whenever info changes
- Switch saves most recent vector from neighbours
- Switch calculates own distance vector by examining cost from
neighbour and adding cost of link to neighbour
- Use link with minimum cost to destination as link to route out.
Examples include RIP and BGP.
- Each switch knows addresses that are direct neighbours
- Switch constructs packets saying who are neighbours - link
state packets.
- Link state packets flooded to all other switches
- Switch constructs complete graph using most recent link state
packets from all other switches
- Use Dijkstra shortest path to figure out routing table.
Examples include OSPF.
2.3.3 Conclusion: Network Properties
Packet Switched Networks present certain fundamental problems to the distributed systems programmer:
- Whilst switches converge on a consistent set of routes, packets
will bounce around in the network, and suffer delays or get dropped.
- Switches shared with other packets, therefore get queues.
- Total latency is therefore variable.
- Loss rate is variable (noise, available switch buffers).
- Queue size management (aka congestion control) changes the bandwidth available to machines. Therefore bandwidth is variable.
Goal: Prevent Misuse of computers
- Definitions
- Authentication
- Private and Public Key Encryption
- Access Control and Capabilities
- Enforcement of security policies
- Examples of Security Problems
Types of Misuse
Protection is to prevent either accidental or intentional misuse
Security is to prevent intentional misuse
Three pieces to security
AuthenticationWho user is
AuthorisationWho is allowed to do what
EnforcementEnsure that people only do what they are allowed to do
A loophole in any of these can cause problem eg
- Log in as super-user
- Log in as anyone, do anything
- Can you trust software to make decisions about 1 and 2?
Common approach: Passwords. Shared secret between two parties. Since
only I know password, machine can assume it is me.
Problem 1 system must keep copy of secret, to check against
password. What if malicious user gains access to this list of
passwords?
Encryption Transformation on data that is difficult to
reverse - in particular, secure digest functions.
2.8.2.1 Secure Digest Functions
A secure digest function
h =
H(
M) has the following properties:
- Given M, is is easy to compute h.
- Given h, it is hard to compute M.
- Given M, it is hard to compute M' such that
H(M) = H(M')
For example: Unix /etc/passwd file
Password
one way transform
encrypted
password
System stores only encrypted version, so ok if someone reads the file.
When you type in password, system encrypts password and compares
against the stored encrypted versions.
Over the years, password protection has evolved from DES, using a
well-known string as the input data, and the password as the key,
through to MD5 to SHA-1.
Passwords must be long and obscure.
Paradox: short passwords are easy to crack, but long
ones, people write down
Improving technology means we have to use longer passwords. Consider
that unix initially required only lowercase 5 letter passwords
How long for an exhaustive search?
265 = 10, 000, 000
- In 1975, 10 ms to check a password
1 day
- In 2003, 0.00001 ms to check a password
0.1 second
Most people choose even simpler passwords such as English words - it takes
even less time to check for all words in a dictionary.
- Extend everyone's password with a unique number (stored in
passwd file). Can't crack multiple passwords at a time.
- Require more complex passwords, eg 7 letter with lower, upper,
number and special
707
8000 billion, or 1 day. But
people pick common patterns eg 6 lower case plus number.
- Make it take a long time to check each password. For example,
delay every login attempt by 1 second.
- Assign long passwords. Give everyone a calculator or smart card
to carry around to remember password, with PIN to activate. Need
physical theft to steal card.
Problem 3 Can you trust the encryption algorithm? Recent
example: techniques thought to be safe such as DES, have back doors
(intentional or unintentional). If there is a back door, you don't
need to do complete exhaustive search.
In distributed systems, the network between the machine on which the
password is typed and the machine the password is authenticating on is
accessible to everyone.
Two roles for encryption:
- Authentication - do we share the same secret?
- Secrecy - I don't want anyone to know this data (eg medical
records)
Use an encryption algorithm that can easily be reversed given the
correct key, and difficult to reverse without the key

- From cipher text, can't decode without password.
- From plain text and cipher text, can't derive password.
- As long as the password stays secret, we get both secrecy and
authentication.
2.8.3.1 Symmetric Encryption
Symmetric encryptions use exclusive ors, addition, multiplication,
shifts and transpositions, all of which are fast on modern processors.
DESThe Data Encryption Standard (DES) was released in 1977.
The 56 bit key is too weak for most uses now, and instead, 3DES or
triple DES is used, which has 128 bit keys.
IDEAThe International Data Encryption Algorithm has 128 bit
keys, and has proven strong against a large body of analysis.
AESThe US based NIST has defined a new encryption algorithm
based on Rijndael, which offers 128, 192 or 256 bit keys.
The server keeps a list of passwords, provides a way for two parties,
A, B to talk to each other, as long as they trust the server.
Notation: Kxy is a key for talking between x and y. (..)
K means
encrypt message (..) with key K.
Simplistic overview:
- A asks server for key:
A
S (Hi! I'd like a key for A,B)
- Server gives back special ``session'' key encrypted in A's key, along
with ticket to give to B:
S
A (Use Kab (This is A! Use Kab)
Ksb)
Ksa
- A gives B the ticket:
A
B ((This is A! Use Kab)
Ksb
Details: add in timestamps to limit how long each key exists, use
single use authenticators so that clients are sure of server
identities, and to prevent machine replaying messages later.
Also have to include encrypted checksums to prevent malicious user
inserting garbage into message.
Public key encryption is a much slower alternative to private key;
separates authentication from secrecy.
Each key is a pair K,K-1
With private key: (text)
K
K = text
With public key:
Can't derive K from K-1, or vice versa
2.8.4.1 Public key directory
Idea: K is kept secret, K-1 made public, such as public directory
For example: (I'm Ian)
K Everyone can read it, but only I
could have sent it (authentication).
(Hi!)
K-1 Anyone can send it but only I can read it
(secrecy).
((I'm Ian)
K Hi!)
K'-1
On first glance, only I can send it, only you can read it.
What's wrong with this assumption?
Problem: How do you trust the dictionary of public keys? Maybe
somebody lied to you in giving you a key.
2.8.5 Secure Socket Layer
- Provides a techniques for data sent over a TCP connection to be
encrypted.
- Uses public key technology to agree upon a key, then 3DES or
whatever to encrypt the session.
- Data encrypted in blocks, optionally with compression first.
- Used in http as https, and for telnet, ftp etc.
2.8.5.1 SSL handshake protocol

Who can do what...
Access control matrix: formalisation of all the permissions in the
system
objects |
file1 |
file2 |
file3 |
... |
users |
|
|
|
|
A |
rw |
r |
|
|
B |
|
rw |
|
|
C |
|
|
r |
|
... |
|
|
|
|
For example, one box represents C can read file3
Potentially huge number of users and objects, so impractical to store
all of these.
- Access control list - store all permissions for all users with
each object
Still might be large number of users. Unix addresses by having rwx
for user group world. Recent systems provide way of specifying
groups of users and permissions for each group
- Capability list - each process stores tickets for the objects it
can operate on.
2.8.6.2 Digital Rights Management
- Digital content is produced by people earning a living, and they
wish to protect their investment
- Digital Rights Management is the catchall term for the
technologies controlling the use of digital content.
- Two main approaches:
Containmentin which the content is wrapped is an encrypted
shell, so that users have to prove their capability of
using the content through knowledge of the key.
Watermarkingwhere the content is marked so that devices know
that the data is protected.
- The problem is how to enforce the checking of
capabilities and enforce the no circumvention requirement.
Enforcer is the program that checks passwords, access control lists,
etc
Any bug in enforcer means way for malicious user to gain ability to do
anything
In Unix, superuser has all the powers of the Unix kernel - can do
anything. Because of coarse-grained access control, lots of stuff has
to run as superuser to work. If bug in any program, you're in
trouble.
Paradox:
- make enforcer as small as possible - easier to get correct, but
simple minded protection model (more programs have high privilege).
- fancy protection - only minimal privilege, but hard to get
right.
2.8.8 Trusted Computing Platform
Question: How do we ensure no software transgresses digital
rights?
Answer: By only allowing approved software to have access to
the data.
- The basic computer and operating system must be running
validated software.
- There must be an untamperable part of the computer/OS that can
hold keys and hand over only to validated software.
- The untamperable part of the computer is the Fritz chip, a
co-processor which holds a unique certificate that it is running
some validated code
- Use of names
- Structure of names
- Name Services
- Domain Name Service (DNS) - The Internet Name Service
Definitions:-
Names- what its called
Address- where it is
Route- how to get there
Object Identificationa service or resource we want to use, eg
a filename, a telecommunications provider, a person
Allow SharingCommunicating processes can pass names and thus
share resources
Location IndependenceIf we separate the name from the address,
can migrate object transparently
SecurityIf large number of possible names, knowing the name of
the object means that it must explicitly have been passed. If
entire system constructed so that names are passed with
authorisation then knowing name means chain of trust to allow
access to object.
- Use as arguments to functions, eg to call a service
- Create names. If an object comes into creation it must be
named. Name should meet rules for system, eg be unique, therefore
naming authority (possibly object) must keep track of what names it
has allocated
- Delete names. When an object disappears from the system, at
some point may wish to delete name and allow re-use.
- Part of the design space for any distributed system is how to
construct the name.
- Choice of design has ramifications for design of rest of system
and performance of service.
- Unique Identifier (UID) aka flat names, primitive names.
- No internal structure, just string of bits.
- Only use is comparison against other UIDs eg for lookup in table
containing information on named object.
- Provide location independence and uniformity
- But Difficult to name different versions of the same
object eg Distributed systems edition 1 vs Distributed Systems
edition 2. Real objects have relationships to each other - useful
to reflect in naming practice
- Difficult to discover address from name - need to search entire
name space
Add partition structure to name to enable some function to be
more efficient, typically location and name allocation

- Domain Name Service (DNS) name - unix.rn.informatics.scitech.sussex.ac.uk.
Each part of name comes from flat name space. Division of name space
and possible objects into smaller space. "uk" reduces objects to
those within the uk, "ac" to those administered by academic
networks, and so on.
- When allocating names, simply add to lowest part of partition.
Low risk of collision, since small number of other objects, all
administered by same authority
- When looking up name, can guess where information will reside.
- Necessary to have a unique name.
- But useful to name objects in different ways, such as by service
they offer eg Postman Pat, John of Gwent, www.informatics.sussex.ac.uk.
- Create name using attributes of object.
- Note that objects can therefore have multiple names, not all of
which are unique
- Choice of name structure depends on system. DNS chooses
partition according to administration of creation, with aliases to
allow naming by service eg ftp.informatics.sussex.ac.uk is also
doc-sun.rn.informatics.scitech.sussex.ac.uk
- Ask all possible objects within the system if they respond to
that name. Can be efficient if network supports broadcast eg
Ethernet and other LANs.
- Equivalent to distributing name table across all objects,
objects storing names referring to them.
- Only want positive responses - all responses would generate a
lot of traffic.
- Scaling problem when move into wide area.
- Greater number of hosts imply greater probability of failure
- Broadcasts consume higher proportion of bandwidth, made worse
due to higher failures needing more location requests
- Broadcast used only on small LAN based systems (and in initial
location of directory)
Parts of the name table are distributed to different servers eg. in
Domain Name Service, servers are allocated portions of the name space
below certain domains such as the root, ac.uk, susx.ac.uk
Names can be partitioned between servers based on
algorithmic clusteringeg apply well-known hash function on name to map to server. May
result in server being remote from object. Only technique for UIDs
structural clusteringif name is structured, use structure to designate names to particular server, such as in DNS
attribute clusteringif names are constructed using attributes, then servers take responsibility for certain attributes.
If a single server is responsible for name space, there is still single
point of failure, and a performance bottleneck.
Most systems therefore
- Replicate name list to other servers. Also increases
performance for heavily accessed parts of name space eg secondary
servers in DNS
- Cache information received by lookup. No need to repeat lookup
if asked for same information again. Increases performance.
Implemented in both client side and server (in recursive calls) in
DNS.
If information is cached, how do we know when its invalid? May
attempt to use inconsistent information.
Alleviated by following features of some distributed systems
- In most systems objects change slowly, so names live for a long
time, and are created infrequently
- If address of an object is wrong, it causes an error. Address user
can recover if it assumes one of the possible problems is inconsistent
information. - Obsolete information can be fixed by addressed object leaving
redirection pointers. Equivalent to leaving a forwarding address to new home.
However, there are always systems which break these assumptions eg
highly dynamic distributed object system, creating lots of objects and
names and deleting lots of names.
- Hidden behind RPC interface.
- Client knows of server to ask (either installed in file, or
through broadcast location).
- Client calls with arguments of name and required attribute, eg
address. In DNS arguments are name and the type of requested
attribute.
- Server will return result with either required attribute or
error message
If name not stored on server, may be stored on other server. Two options
Recursiveserver asks other possible server about name and
attribute, which may then have to ask another server and so on

Iterativeserver returns address of other possible server to client, who then resends request to new server.

- Names can be flat, or they can be structured
- Centralised name servers suffer from availability - distributed
name servers suffer from inconsistency
- Interaction with name server best modelled by RPC
A
Distributed File System provides transparent access to files
stored on a remote disk
Themes:
- Failures - what happens when server crashes but client doesn't?
or vice versa
- Performance
Caching - use caching at both the
clients and the server to improve performance
- Cache Coherence - how do we make sure each client sees most up
to date copy of file?
- Request for operation on file goes to OS.
- OS recognises file is remote and constructs RPC to remote file server
- Server receives call, does operation and returns results.
- Subtrees in directory structure generally map to file system.
Provides access transparency

- Simple approach: Use RPC to forward each file system request to remote
server (older versions of Novell Netware).
- Example operations: open, seek, read, write, close
- Server implements operations as it would for local request and passes
result back to client

Advantageserver provides consistent view of file system to both A
and B.
Disadvantagecan be lousy performance
- Going over network is slower than going through memory
- Lots of network traffic - congestion
- Server can be a bottleneck - what if lots of clients
Main idea - uses caching to reduce network load
- Cache file blocks, file headers etc in both client and
servers memory.
- More recent NFS implementations use a disk cache at the client
as well.

Advantage Issues: failure and cache consistency
What if server crashes? Can client wait until server comes back up
and continue as before?
- Any data in server memory but not yet on disk can be lost
- If there is shared state across RPCs, eg open seek read. What
if server crashes after seek? Then when client does ``read'' it will
fail.
- Message re-tries - suppose server crashes after it does ``rm
foo'' but before it sends acknowledgement? Message system will
retry and send it again. How does system know not to delete it
again (someone else may have created new ``foo''). Could use
transactions, but NFS takes more ad hoc approach.
What if client crashes?
- Might lose modified data in client cache.
- Write-through caching - when a file is modified, all modified
blocks are sent immediately to the server disk. To the client,
``write'' doesn't return until all bytes are stored on disk.
- Stateless Protocol - server keeps no state about client, except
as hints to improve performance
- Each read request gives enough information to do entire operation -
ReadAt(inumber, position) not Read(openFile).
- When server crashes and restarts, can start processing requests
immediately, as if nothing had happened.
- Operations are idempotent: all requests are ok to repeat
(all requests are done at least once). So if server crashes between
disk I/O and message send, client can resend message and server
does request over again
- Read and write file blocks are easy - just re-read or re-write file
block, so no side-effects.
- What about ``remove''? NFS just ignores this - does the remove twice
and returns error if file not found, application breaks if
inadvertently removes other file.
- Failures are transparent to client system.
Is this good idea? What should happen if server crashes? If
application in middle of reading file when server crashes, options:
- Hang until server returns (next week...)
- return an error? But networked file service is transparent.
Application doesn't know network is there. Many Unix Apps ignore
errors and crash if there is a problem.
NFS has both options - the administrator can select which one.
Usually hang and return error if absolutely necessary.
- What if multiple clients sharing same files? Easy if they are both
reading - each gets a local copy in their cache.
- What if one writing? How do updates happen?
- Note NFS has write-through cache policy. If one client modifies file,
writes through to server.
- How does other client find out about change?
- In NFS, client polls server periodically to check if file has changed.
Polls server if data hasn't been checked in every 3-30 seconds (Exact
time is tunable parameter)

- When file changed on one client, server is notified, but other clients
use old version of file till timeout. They then check server and get
new version.
- What if multiple clients write to same file? In NFS get either
version or mixed version. Completely arbitrary!
- What should happen? If one CPU changes file, and before it completes,
another CPU reads file?
- We want isolation between operations, so read
should get old file if it completes before write starts, new version if it
starts after write completes. Either all new or all old any other
way cf serialisability.
- its simple
- its highly portable
- but its sometimes inconsistent
- but it doesn't scale to large numbers of clients
Note that this describes NFS version 3.
AFS (CMU late 80s)

DCE DFS (commercial product)
- Callbacks: Server records who has copies of file
- Write through on close
- If file changes, server is updated (on close)
- Server then immediately tells all those with old copy
Session semantics - updates only visible on close
- In Unix (single machine), updates visible immediately to other
programs who have file open.
- In AFS, everyone who has file sees old version, anyone who opens file
again will see new version.
In AFS:
- on open and cache miss: get file from server, set up callback
- on write close: send copy to server; tells all clients with
copies to fetch new version from server on next open
- Files cached on local disk; NFS (used to) cache only in memory

- What if server crashes? Lose all your callback state.
- Reconstruct information from client - go ask everyone ``who has which
files cached''
- Disk as cache
more files cached locally
- Callbacks
server not involved if file is
read-only (Majority of file access is read-only)
- But on fast LANs, local disk is slower than remote memory
NFS version 4 will provide session semantics when it is
deployed by vendors in the 2005 timeframe.
- Remote file performance needs caching to get decent performance.
- Central server is a bottleneck
- Performance bottleneck:
- All data is written through to server
- all cache misses go to server
- Availability Bottleneck
- Server is single point of failure
- Cost bottleneck
- Server machines high cost relative to workstation
2.12 Content Distribution Networks
Main Points
- Building content caches
- Pre-fetching data
- Using your neighbours - BitTorrent
2.12.1 Getting Content over a Network

- Users want to download content from serversas quickly as possible
- What structures can we use to improve their experience, and the
impact upon the network?
2.12.2 Web Caches

- Large organisations can improve web performance by sticking a
web cache in front of HTTP connections
- The cache inspects an incoming HTTP request to see if it can
satisfy the request from locally cached objects.
- If yes, then the object is returned, and the last reference time
is updated
- If no, then the object is retrieved and copied locally if allowed.
- Web objects can be marked as non-cacheable.
- Caching is a major part of the HTTP standard
2.12.2.1 Cache Performance
- The performance of a web cache is difficult to model, since the
performance is a complex mixture of interaction between TCP, HTTP
and content.
- Caches work because of temporal locality, due to popularity of
content, and spatial locality, due to structure of HTML documents
- Measurements of web proxies give the following results (based on
JANET caches)
- Request hit rate is about 60%.
- Volume hit rate is about 30%.
- Latency improvement is around a factor of 3 on average
2.12.2.2 Problems with Caching
- Not all content is marked as cacheable, eg because the site
wants accurate records of who looks at content.
- All hosts behind a cache appear to come from one
address.
Question: Why is this a problem?
2.12.3 Pre-fetching Data
- Can we improve performance by pro-actively distributing content
to caches?
- Yes...
2.12.3.1 Active Content Distribution
- The html uses links to the cdn domain name eg akamai.com
- The internet service provider has entries in their local DNS for
akamai.com pointing to a machine on the ISP's network.
- Content will therefore be supplied from the local machine rather
than the original machine
- Customer interaction improved.
- Bandwidth requirements of servers reduced
2.12.4 Using your Peers: BitTorrent

- Why not use the other people receiving the content?
- BitTorrent downloads from other machines
- Basic Idea:
- To download, the host contacts a machine tracking those already
downloading the torrent.
- Peers are selected at random from the tracker.
- Pieces are selected to download from those available on the
downloader's peer set until all pieces of the file have been received
2.12.4.1 The .torrent file
- To start downloading from a torrent, the consumer must
first locate a .torrent file.
- This contains information about the file length, name and
hashing numbers of the file blocks, and the url of a tracker
- The file is split into 250 KByte pieces, each having a SHA1 hash
calculated.
- A tracker holds the IP addresses of current downloaders
2.12.4.2 Locating peers
- After receiving the torrent file, the downloader contacts the
tracker
- The tracker inserts the downloader into its list of downloaders,
and returns a random list of other downloaders
- This list becomes the downloader's peers
- Question What is the shape of the overlay Graph?
2.12.4.3 Choosing Pieces
- The downloader will contact its peers to discover what pieces
they have.
- It then chooses a piece to download:
- The first choice is made randomly, so as to spread load
- Subsequent pieces are based on a rarest-first approach to
increase probability all pieces are available.
- When a peer has downloaded a new piece which matches the SHA1
hash, it notifies its peers it has a new piece.
2.12.4.4 Choosing Downloaders
- A request to upload is accepted if the requester recently
uploaded to it
- This provides an incentive to machines to share data
- Periodically other machines are tried for upload
2.12.5 Summary
- Web caching improves performance by a reasonable factor,
dependent on situation
- Pro-active content distribution can reduce latency and improve
bandwidth usage for popular services
- BitTorrent can improve bandwidth usage by spreading load across
peers.
2.13 Replication: Availability and Consistency
- Motivation for replication
- Multicasting updates to a group of replicas
- Total Ordering
- Causal Ordering
- Techniques for ordering protocols
- ISIS CBCAST
Performance enhancement
- Single Server acts as a bottleneck - if we can balance load
amongst multiple servers, get apparent performance gain
- If clients are geographically distributed, we can site servers
near clients and reduce communication costs
Availability
Fault ToleranceEven in the presence of failure, the service
will continue to give the correct service
- Stronger than availability, since can provide real-time
guarantees (with extra work!)
- Can protect against arbitrary failure where machines feed
wrong information (Byzantine Failure)
A collection of replicas should behave as if state was stored
at one single site
- When accessed by client, view should be consistent
- Replication should be transparent - client unaware that servers
are replicated
If we are providing a replica service, replica can be passive or
active.
Passive replicas are standbys, to maintain service on failure. No
performance improvement.
Standbys must monitor and copy state of active server
Provide availability in simple manner.
Used for highly available systems eg space applications
- Clients can modify resource on any of the replicas.
- What happens if another client requests resource before replica
has informed others of modification, as in cache consistency in
distributed file systems?
- Answer depends upon application...

- Users read and submit articles through Front End.
- Articles replicated across a number of servers
- Front Ends can connect to any server
- Servers propagate articles between themselves so that all
servers hold copies of all articles.
- User membership of a given bbs is tightly controlled.
Questions on BBS:
- How should messages be passed between replicas?
- Should order of presentation of articles to clients be the same
across all replicas? Are weaker ordering semantics possible?
- When a client leaves bbs group, can they see articles submitted
after they have left? Is this desireable?
- What should happen when replicas are temporarily partitioned?
Clients read and update state at any of the replicated servers eg
submit messages in bbs. To
maintain consistency, three things are important
Multicast communicationMessages delivered to all servers in
the group replicating data
Ordering of messagesUpdates occur in the same ``order'' at
each server
Failure recoveryWhen servers or the network fails, and comes
back, the replicas must be able to regain consistency. Done through
Voting and Transactions (later in course)
A Process Group: a collection of processes that co-operate
towards a common goal.
Multicast communication: One message is sent to the members of a
process group
Idea: Instead of knowing address of process, just need to know an
address representing the service. Lower levels take care of routing
messages.
Useful for:
Replicated ServicesOne update message goes to all replicas,
which perform identical operations. Reduces communication costs.
Locating objects in distributed servicesRequest for object
goes to all processes implementing service, but only process holding
object replies.
Maintenance of group information is a complex function of the name
service (for tightly managed groups)
Create GroupCreate a group identifier that is globally
unique.
Join GroupJoin a group. Requires joining process information
to be disseminated to message routing function. May require
authentication and notification of existing members.
Leave GroupRemove a process from a group. May require
authentication, may occur as a result of failure or partition. Need
to notify message routing function, may notify other members.
Member ListSupply the list of processes within a group.
Needed for reliable message delivery, may require authentication.
If two processes multicast to a group, the messages may be arbitrarily
ordered at any member of the group.
Process P1 multicasts message a to a group comprising processes P1, P2, P3 and
P4.
Process P2 multicasts message b to the same group
The order of arrival of a and b at members of the group can be different.

2.13.6.1 Ordering example

- Order of operations may be important - delete object, create
object.
- If delete object arrives before create object, then operation
not completed
2.13.6.2 Ordering Definitions
Various definitions of order with increasing complexity in
multicasting protocol
FIFO OrderingMessages from one process are processed at all
group members in same order
Causal OrderingAll events which preceded the message
transmission at a process precede message reception at other
processes. Events are message receptions and transmissions.
Total OrderingMessages are processed at each group member in
the same order.
Sync OrderingFor a sync ordered message, either an event
occured before message reception at all processes, or after
message. Other events may be causally or totally ordered.
Achieved by process adding a sequence number to each message.
Group member orders incoming messages with respect to sequence number.
Applicable when each process state is separate, or operations don't
modify state, just add incremental updates or read.
When several messages are sent to a group, all members of the group
receive the messages in the same order.
Two techniques for implementation:
Sequencer Elect a special sequencing node. All messages are
sent to sequencer, who then sends messages onto replicas. FIFO
ordering from sequencer guarantees total ordering. Suffers from
single point of failure (recoverable by election) and bottleneck.
Holdback Queue
Received messages are not passed to the
application immediately, but are held in a
holdback queue until the ordering constraints are met.
Sender negotiates a largest
sequence number with all replicas.
- Replicas store largest final sequence number yet seen Fmax,
and largest proposed sequence number Pmax
- Sender sends all replicas message with temporary ID.
- Each Replica i replies with suggested sequence number of
max(Fmax, Pmax) + 1. Suggested sequence number
provisionally assigned to message and message placed in holdback
queue (ordered with smallest sequence number at front)
- Sending site chooses largest sequence number and notifies
replicas of final sequence number. Replicas replace provisional
sequence number with final sequence number.
- When item at front of queue has an agreed final sequence number,
deliver the message.
``Cause'' means ``since we don't know application, messages might have
causal ordering''
a and b are events, generally sending and receiving of
messages.
We define the causal relation,
a
b, if
- if a and b are events at the same process,
a
b implies a happened before b
- if a is a message sent by process P1 and b is
the arrival of the same message at P2, then
a
b is true
In bulletin board, an article titled ``re: Multicast Routing'' in
repsonse to an article called ``Multicast Routing'' should always come
after, even though may be received before the initial article
ISIS is a real commercial distributed system, based on process groups.
Causal ordering for multicast within a group is based around
Vector Timestamps
The vector VT has an identifier entry for each member of the
group, typically an integer.
Vector timestamps have one operation defined
merge(u, v)[k] = max(u[k], v[k]), for k = 1..n
Incoming messages are placed on a holdback queue, until all messages
which causally precede the message have been delivered.
- All processes pi initialise the vector to zero
- When pi multicasts a new message, it first increments VTi[i] by 1; it piggybacks
vt = VTi on the message
- Messages are delivered to the application in process Pj when
- The message is the next in sequence from pi i.e.
vt[i] = VTj[i] + 1
- All causally prior messages that have been delivered to pi must have been delivered to pj, i.e.
VTj[k]
vt[k] for k
i.
- When a message bearing a timestamp vt is delivered to pj, pj's timestamp is updated as
VTj = merge(vt, VTj)
In words
- Incoming vector timestamp is compared to current timestamp.
- If conditions for delivery to process not met, then message placed on
holdback queue.
- When an incoming message is delivered, the timestamp is updated by the
merge.
- Examine all messages in the holdback queue to see if they can be
delivered.
- CBCAST requires reliable delivery.

- When group membership changes, what set of messages should be
delivered to members of changed group?
- What happens to undelivered messages of failed members?
- What messages should new member get?
- ISIS solves by sending a sync ordered message announcing
that the group view has changed. Messages thus belong to a
particular group view.
- Use coordinator to decide which messages belong to which view.
- Replication of services and state increase availability
- Replication increases performance
- Replication increases Fault tolerance
- To maintain consistency, multicast updates to all replicas
- Use sequence numbers to maintain FIFO ordering
- Use Vector Timestamps to maintain Causal Ordering
- Use elected sequencers or identifier negotiation to maintain
total ordering
- Stateful Servers
- Atomicity
- Transactions
- ACID
- Serial Equivalence
- Servers manage a resource, such as a database or a printer
- Attempt to limit problems of distributed access by making server
stateless, such that each request is independent of other
requests.
- Servers can crash in between servicing clients
- Client requests cannot interfere with each other (assuming
concurrency control in server
- But we can't always design stateless servers...
- Some applications better modelled as extended
conversations, eg retrieving a list of records in a
large database better modelled as getting batch of records at a
time.
- If application requires state to be consistent across a number
of machines, then each machine must recognise when it can update
internal data. Needs to keep track of state of distributed
conversation
- If long duration then, then need to be aware of state.
- If other conversations need to go on - eg modify records during
retrieval, how do we allow concurrency?
- What happens if machine fails - need to recover.
- Should also aim to be fault tolerant
Stateful server have two requirements
- Accesses from different clients shouldn't interfere
with each other
- Clients should get fast access to the server
We define atomicity as
All or NothingA client's operation on a server's resource
should complete successfully, and the results hold thereafter (yea,
even unto a server crash), or it should fail and the resource should
show no effect of the failed operation
IsolationEach operation should proceed without interference
from other clients' operations - intermediate effects should not be
visible.
Mutual ExclusionFor a multi-threaded server, if two or more
threads attempt to modify the same piece of data, then the updates
should have mutual exclusion around the updates to provide
isolation, using semaphores or monitors
SynchronisationIn situations such as Producer Consumer, need
to allow one operation to finish so second operation can use
results, needing isolation.

- An ATM or cashmachine allows transfer of funds between accounts.
- Accounts are held at various machines belonging to different
banks
- Accounts offer the following operations
depositPlace an amount of money in an account
withdrawTake an amount of money from an account
balanceGet the current value in an account
- Operations implemented as read() and write() of values, so
withdraw x from A and deposit x in B implemented as
- A.write( A.read() - x)
- B.write( B.read() + x)
Transactions are technique for grouping operations on data so that
either all complete or none complete
Typically server offers transaction service, such as:
beginTransaction(transId)Record the start of a transaction and
associate operations with this transId with this transaction.
commitTransaction(transId)Commit all the changes the
operations in this transaction have made to permanent storage.
abortTransaction(transId)Abort all the changes the transaction
operations have done, and roll back to previous state.
Transactions are described by the ACID mnemonic
AtomicityEither all or none of the Transaction's operations
are performed. If a transaction is interrupted by failure, then
partial changes are undone
ConsistencySystem moves from one self-consistent state to another
IsolationAn incomplete transaction never reveals partial state
or changes before commiting
DurabilityAfter committing, the system never loses the results
of the transaction, independent of any subsequent failure
2.14.4.2.1 Lost Update
Transaction T |
|
Transaction U |
|
A.withdraw(4,T) |
|
C.withdraw(3,U) |
|
B.deposit(4,T) |
|
B.deposit(3,U) |
|
balance = A.read() |
£100 |
|
|
A.write(balance - 4) |
£96 |
|
|
|
|
balance = C.read() |
£300 |
|
|
C.write(balance - 3) |
£297 |
balance = B.read() |
£200 |
|
|
|
|
balance = B.read() |
£200 |
|
|
B.write(balance + 3) |
£203 |
B.write(balance + 4) |
£204 |
|
|
Transaction T |
|
Transaction U |
|
A.withdraw(100,T) |
|
Bank.total(U) |
|
B.deposit(100,T) |
|
|
|
balance = A.read() |
£200 |
|
|
A.write(balance - 100) |
£100 |
|
|
|
|
balance = A.read() |
£100 |
|
|
balance = B.read() |
£300 |
|
|
+ balance |
|
|
|
balance = C.read() |
£300+ |
|
|
+ balance |
|
balance = B.read() |
£200 |
|
|
B.write(balance + 100) |
£300 |
|
|
Definition: Two transactions are
serial if all the operations in
one transaction precede the operations in the other.
eg the following actions are serial
Ri(x)Wi(x)Ri(y)Rj(x)Wj(y)
Definition: Two operations are in conflict if:
- At least one is a write
- They both act on the same data
- They are issued by different transactions
Ri(x)Rj(x)Wi(x)Wj(y)Ri(y) has
Rj(x)Wi(x) in conflict
Definition: Two schedules are computationally equivalent if:
- The same operations are involved (possibly reordered)
- For every pair of operations in conflict, the same operation
appears first in each schedule
So, a schedule is serialisable if the schedule is computationally
equivalent to a serial schedule.
Question: Is the following schedule for these two transaction serially
equivalent?
Ri(x)Ri(y)Rj(y)Wj(y)Ri(x)Wj(x)Wi(y)
Transactions may themselves be composed of multiple transactions
eg Transfer is a composition of withdraw and
deposit transactions, which are themselves composed of read and
write transactions
Benefits:
- Nested transactions can run concurrently with other transactions
at same level in hierarchy
- If lower levels abort, may not need to abort whole transaction.
Can instead use other means of recovery.
- Transactions provide technique for managing stateful servers
- Need to worry about concurrency control
- Need to worry about aspects of distribution
- Need to worry about recovery from failure
- Problem restatement
- Locking
- Optimistic control
- Timestamping
- To increase performance, multiple transactions must be able to
carry on work simultaneously...
- ...but if data is shared, then can lead to problems such as lost
updates and inconsistent retrievals.
- So we must ensure schedules of access to data for concurrent
transactions are computationally equivalent to a serial schedule of
the transactions.
- As in operating systems, locks control access for different
clients
- Granularity of data locked should be small so as to maximise
concurrency, with trade-off against complexity.
- To prevent intermediate leakage, once lock is obtained, it must
be held till transaction commits or aborts
- Conflict rules determine rules of lock usage
- If operations are not in conflict, then locks can be shared
read locks are shared
- Operations in conflict imply operations should wait on lock
write waits on read or write lock, read waits on write
lock
- Since can't predict other item usage till end of transactions,
locks must be held till transaction commits or aborts.
- If operation needs to do another operation on same data then
promotes lock if necessary and possible - operation may
conflict with existing shared lock
- When operation accesses data item within transaction
- If item isn't locked, then server locks and proceeds
- If item is held in a conflicting lock by another transaction,
transaction must wait till lock released
- If item is held by non-conflicting lock, lock is shared and
operation proceeds
- If item is already locked by same transaction, lock is
promoted if possible (refer to rule b)
- When transaction commits or aborts, locks are released
- Locks generally implemented by a lock manager
lock(transId,DataItem,LockType)Lock the specified item if
possible, else wait according to rules above
unLock(transId)Release all locks held by the transaction
- Lock manager generally multi-threaded, requiring internal
synchronisation
- Heavyweight implementation
Transactions T and U.
- T:
RT(i), WT(j, 44)
- U:
WU(i, 55)), RU(j), WU(j, 66)
Question What are the possible schedules allowed under strict
locking?
Question Are there any schedules computationally equivalent
to a serial schedule which are disallowed?
- Locks imply deadlock, under following conditions
- Limited access (eg mutex or finite buffer)
- No preemption (if someone has resource can't take it away)
- Hold and wait. Independent threads must possess some of its needed resources and waiting for the remainder to become free.
- Circular chain of requests and ownership.
- Most common way of protecting against deadlock is through
timeouts. After timeout, lock becomes vulnerable and can be
broken if another transaction attempts to gain lock, leading to
aborted transactions
- Locking is overly restrictive on the degree of concurrency
- Deadlocks produce unnecessary aborts
- Lock maintenance is an overhead, that may not be required
- Most transactions do not conflict with each other
- So proceed without locks, and check on close of transaction that
there were no conflicts
- Analyse conflicts in validation process
- If conflicts could result in non-serialisable schedule, abort
one or more transactions
- else commit
Transaction has following phases
- Read phase in which clients read values and acquire tentative
versions of items they wish to update
- Validation phase in which operations are checked to see if they
are in conflict with other transactions - complex part. If invalid,
then abort.
- If validated, tentative versions are written to permanence, and
transaction can commit (or abort).
- Validation based upon conflict rules for serialisability
- Validation can be either against completed transactions or
active transactions - backward and forward validation.
- Simplify by ensuring only one transaction in validation and
write phase at one time
- Trade-off between number of comparisons, and transactions that
must be stored.
- A transaction in validation is compared against all
transactions that haven't yet committed
- Writes may affect ongoing reads
- The write set of the validating transaction is compared
against the read sets of all other active transactions.
- If the sets conflict, then either abort validating
transaction, delay validation till conflicting transaction
completes, or abort conflicting transaction.
- Writes of current transaction can't affect previous
transaction reads, so we only worry about reads with overlapping
transactions that have committed.
- If current read sets conflict with already validated
overlapping transactions write sets, then abort validating
transaction
Operates on tentative versions of data
- Each Transaction receives global unique timestamp on initiation
- Every object, x, in the system or database carries the maximum (ie
youngest) timestamp of last transaction to read RTM(x)2.3 and maximum of last transaction to write
WTM(x)2.4
- If transaction requests operation that conflicts with younger
transaction, older transaction restarted with new timestamp.
- Transactions committed in order of timestamps, so a transaction
may have to wait for earlier transaction to commit or abort before
committing.
- Since tentative version is only written when transaction is
committed, read operations may have to wait until the last
transaction to write has committed.
An operation in transaction Ti with start time TSi
is valid if:
- The operation is a read operation and the object was last
written by an older transaction ie
TSi > WTM(x). If read permissible,
RTM(x) = MAX(TSi, RTM(x))
- The operation is a write operation and the object was
last read and written by older transactions ie
TSi > RTM(x) and
TSi > WTM(x). If permissible,
WTM(x) = TSi
- Locks are commonest ways of providing consistent concurrency
- Optimistic concurrency control and timestamping used in some
systems
- But, consistency in concurrency is application dependent - for
shared editors, people may prefer to trade speed of execution
against possibilities of conflict resolution. Problems can occur
with long term network partition. Approaches based on notification
and people resolution becoming popular.
- Models for distributed transactions
- Attaining distributed commitment
- Distributed Concurrency Control

- Till now, transactions have referred to multiple clients, single
server.
- How do we have multiple clients interacting with multiple
servers? eg complicated funds transfer involving different accounts
from different banks?
- Generalise transactions to distributed case...
General characteristics of distributed systems
- Independent Failure Modes
- No global time
- Inconsistent State
Need to consider:
- how to achieve distributed commitment (or abort)
- how to achieve distributed concurrency control

- If client runs transactions, then each transaction must complete
before proceeding to next
- If transactions are nested, then transactions at same level can
run in parallel
- Client uses a single server to act as coordinator for all
other transactions. The coordinator handles all communication with
other servers
Question: What are the requirements of transaction ids?
- Distribution implies independent failure modes, ie machine can
fail at any time, and others may not discover.
- If one phase commit, client requests commit, but one of
the server may have failed - no way of ensuring durability
- Instead, commit in 2 phases, thus allowing server to request abort.
- One coordinator responsible for initiating protocol.
- Other entities in protocol called participants.
- If coordinator or participant unable to commit, all parts of
transaction are aborted.
- Two phases
Phase 1Reach a common decision
Phase 2Implement that decision at all sites
2.16.3.1.1 2 Phase Commit Details
- Phase 1 The coordinator sends a Can Commit? message to all
participants in transaction.
- Participants reply with vote yes or no. If vote
is no participant aborts immediately.
- Phase 2 Coordinator collects votes including own:
- If all votes are yes, coordinator commits transaction
and sends DoCommit to all participants.
- Otherwise transaction is aborted, and coordinator sends abortTransaction to all participants.
- When a participant recieves DoCommit, it commits its
part of the transaction and confirms using HaveCommited
2.16.3.1.2 2 Phase Commit Diagram
Note:
- If participant crashes after having voted to commit, it can ask
coordinator about results of vote.
- Timeouts are used when messages are expected.
- Introduces new state in transaction Prepared to commit.
- Locking is done per item, not per client.
- No problems generalising to multiple servers...
- ...except in dealing with distributed deadlock
- Same techniques as usual, but interesting dealing with
distributed deadlock detection.
- Need to worry about distributed validation
- Simple model of validation had only one transaction being
validated at a time - can lead to deadlock if different cordinating servers
attempt to validate different transaction.
- Also need to validate in correct serialisable order.
- One solution is to globaly only allow one transaction to
validate at a time.
- Other solutions is to validate in two phases with timestamp
allocation - local, then global to enforce ordering.
- If clocks are approximately synchronised, then timestamps can be
< localtimestamp, coordinatingserverid >
pairs, and an ordering
defined upon server ids.
- Nested Transactions are best model for distributed transactions
- Two Phase Commit protocol suitable for almost all case
- Distributed Concurrency control is only slightly more diffcult
than for single server case
- Failure Modes
- Recovery Techniques
- Partitions and quorum voting
For Transactions to be
atomic and
durable, need to examine
failures
- Transaction-local failures, detected by the application which
calls abort eg insufficient funds. No info loss, need
to undo changes made.
- Transaction-local failures , not detected by application, but by
system as whole, eg divide by zero. System calls abort.
- System failures affecting transactions in progress but not media
eg CPU failure. Loss of volatile store and possibly all
transactions in progress. On recovery, special recovery manager
undoes effects of all transactions in progress at failure.
- Media failures affecting database eg head crash. No way of
protecting against this.
- We assume a machine crashes, but then is fixed and returns to operation2.5.
- We need to recover state to ensure that the guarantees of the
transactional systems are kept.
- Use a recovery file or log that is kept on permanent storage.
- Recovery from failure handled by entity called Recovery Manager.
- Keeps information about changes to the resource in a recovery file (also called Log)
kept in stable storage - ie something that will survive failure.
- When coming back up after failure, recovery manager looks through
recovery file and undoes changes (or redoes changes) so as uncommitted
transactions didn't happen, and committed transactions happened.
- Events recorded on Recovery file for each change to
an object in database.
Information recorded per event include:
Transaction IdTo associate change with a transaction
Record IdThe identifier of the object
Action typeCreate/Delete/Update etc
Old ValueTo enable changes to be undone
New ValueTo enable changes to be redone
Also log beginTransaction, prepareToCommit, commit, and abort actions,
with their associated transaction id.
If after failure,
- the database is undamaged, undo all changes made by
transactions executing at time of failure
- the database is damaged, then restore database from archive and
redo all changes from committed transactions since archive date.
The Recovery file entry is made and committed to stable storage before
the change is made - incomplete transactions can be undone, committed
transactions redone.
What might happen if database changed before recovery file written?
Note that recovery files have information needed to undo transactions.
Calculation of which transaction to undo and redo on large logs can be
slow.
Recovery files can get too large
Instead, augment recovery file with checkpoint
- Force recovery file to stable storage
- Write checkpoint record to stable store with
- A list of currently active transactions
- for each transaction a pointer to the first record in
recovery file for that transaction
- Force database to disk
- Write address of checkpoint record to restart location atomically
To recover, have undo and redo lists. Add all active transactions at
last checkpoint to undo list
- Forwards from checkpoint to end,
- If find beginTransaction add to undo list
- If find commit record add to redo list
- If find abort record remove from undo list
- backwards from end to first record in checkpointed transactions,
execute undo for all transaction operations on undo list
- Forwards from checkpont to end, redo operations for transactions
on redo list
At checkpoint can discard all recovery file to first logged record in
checkpointed transactions

- Coordinator uses prepared to signal starting protocol,
commit on signalling DoCommit and done to indicate end
of protocol in recovery file.
- Participant uses uncertain to indicate that it has
replied yes to commit request, and commited when it
receives DoCommit.
- On recovery, coordinator aborts transactions which reach
prepared, and resends DoCommit when in commit state
but not done
- Participant requests decision from coordinator if in
uncertain state, but not commited.
Transactions are often used to keep replicas consistent.
If network partitions (cable breaks), replicas divided into two or more sets
(possibly with common members).

Can we still write and read from any of the sets?
Yes, but
- Must reduce possible read and write sets to maintain consistency
- Or relax consistency requirements and resolve problems when
partition is healed
Consider set of replicas, where replicated objects have version
numbers at each replica.
- Assign a weighting of votes to each replica, indicating importance.
- For client to perform operation, it must gather votes for all the
replicas it can talk to (denote X).
- X
votes set for read quorum R to enable read
- X
votes set for write quorum W to enable write.
- As long as
- W > half the total number of votes
- R + W > total number of votes in group
Each Read quorum and each write quorum will have at least one member
in common.
For three replicas, R1, R2, R3, we can allocate votes to give
different properties depending upon requirements
Replica |
config 1 |
config 2 |
config 3 |
R1 |
1 |
2 |
1 |
R2 |
0 |
1 |
1 |
R3 |
0 |
1 |
1 |
- What should R and W be set to in the three
configurations?
- What properties result from these configurations?
When write happens, object has a version number incremented
To read, collect votes from replica managers with version numbers of
object. Guaranteed to have at least one up to date copy if in read
quorum, from which read occurs.
To write, collect votes from replica managers with version numbers of
object. If write quorum with up to date copy not discovered, then
copy up to date copy around to create write quorum. Then write is
allowed.
Manipulating R and W give different characteristics eg R = 1 and W = number of copies gives unaminous update.
Cached copies of objects can be incorporated as weak
representatives with 0 votes, but usable for reads.
- Atomicity comes from using logging techniques on operations at
server, where log is kept on stable storage
- Voting can be used to give availability for resources on
partitioned replicas.