source: http://www.cogs.susx.ac.uk/courses/dist-sys/node1.html

2.1 Introduction

Main Points

What we are going to cover
What distributed systems are
What characteristics well-designed systems have

2.1.1 Course Outline

What are we going to do?

1 lectures on computer communications and networks and loosely
coupled systems, revising material from the Multimedia
Communications Technology course. The Postgraduate course will get
an additional two hours in their seminar slot.
5 lectures on Remote Procedure Call, distributed objects and
classic distributed systems such as NFS
3 lectures on integrating distributed systems through the web,
content distribution networks and Peer to Peer computing.
5 lectures on closely coupled systems for distributed
transactions
1 lecture on recent advances in networks and distributed
systems

2.1.2 What's a Distributed System

A distributed system: physically separate computers working
together

Cheaper and easier to build lots of simple computers
Easier to add power incrementally
Machines may necessarily be remote, but system should work together
Higher availability - one computer crashes, others carry on
working
Better reliability - store data in multiple locations
More security - each piece easier to secure to right level.

2.1.2.1 The real world...

In real life, can get:

Worse availability - every machine must be up. ``A distributed
system is one where some machine you've never heard of fails and
prevents you from working''
Worse reliability
Worse security

Problem: Coordination is more difficult because multiple
people involved, and communication is over network.

Your Task: What are the distributed interactions when you
login at an x-terminal?

2.1.2.2 Interactions at an X terminal

Simplified interactions

2.1.3 Example Distributed Systems

Electronic MailMail delivered to remote mailbox. Requires
global name space to identify users, transport mechanisms to get
mail to mailbox
Distributed Information - WWWRemote information hidden below
hypertext browser. Caching and other features operate transparently
Distributed File SystemFiles stored on many machines,
generally not machine you're working on. Files accessed
transparently by OS knowing they're remote and doing remote
operations on them such as read and write e.g. Network File System
(NFS)
Trading Floor SystemBids made, stocks sold, screens updated.

(Source:The Eight Fallacies of Distributed Computing -
Peter Deutsch

2.1.4 What do we want from a Distributed System?

Resource Sharing
Openness
Concurrency
Scalability
Fault Tolerance
Transparency

Your Task: order the importance of each of these features for the
example systems in the previous slide.

2.1.5 Elements of a Distributed System

Another way to view a system...

Communications system
Messages
Machines
Processes on Machines
Programs
People

2.1.6 Conclusion

Its difficult to design a good distributed system: there are a
lot of problems in getting ``good'' characteristics, not the least of
which is people
Over the next ten weeks you will gain some insight into how to
design a good distributed system.

2.2 Bits and Bytes

Main Points

Bits and Bytes
Kilo, Mega, Giga, Tera
Memory, Packets and Bit Manipulation

2.2.1 Bits, Bytes, Integers etc

All data in computers is held as a sequence of ones and zeros.
A number is represented as the base 2 number held in some
pre-ordained fixed number of bits
Almost all other data is represented as various sets of
numbers.
A byte is 8 bits sequenced together - what is the
maximum number?
Integers (in Java and most other languages nowadays)
are 32 bits long - what is the maximum number?

2.2.2 Prefices

In communications we talk often about throughput in
bits/second, or moving files of some particular size. We use
magnitude prefices for convenience
kilo1000 x

mega1000000 x

giga
10⁹ x

tera
10¹² x
There is often confusion as to whether a kilobyte is 1000 or
1024 bytes. When dealing with processor architectures, its
generally 1024. When dealing with communications, its
generally 1000. State assumptions if it is not obvious from
context.

2.2.3 Memory and Packets

A computer stores data as bits in memory.
When it wants to send this data to another computer, it copies
the bits into the memory of the communications device.
The communications device sends the bits through the network
to the other machine (we'll cover the details of this in the
coming week).
The other machine's communication device places the bits
into memory which the other machine can access.
The tricky bits come in ensuring that both machines
interpret the bits correctly.

2.2.4 Bit Manipulation

Not only must the data be sent, but accompnying information
allowing the computers to interpret the context of the data.
Communications software must be able to pick out arbitrary bits
from an opaque bit sequence and interpret their meaning.
We do this using bitwise operations - and, or, exclusive or,
negation.

2.3 Foundations of Distributed Systems

Aim: How do we send messages between computers across a network?
Physical Concepts:Bandwidth, Latency
Packet Concepts:Data, Headers
Routing Concepts:Shared Media, Switches and Routing Tables,
Routing Protocols

For more detailed notes, see Dan Chalmers' Multimedia Communications Technology notes in http://www.informatics.sussex.ac.uk/courses/mct/.

2.3.1 Physical Concepts

What is a message?

A piece of information which needs to move from one process to
each other eg a request to open a file, an email message, the
results of a remote print request.

For the network, this is a just a sequence of bits in memory.

Need to communicate this sequence of bits across communications
system to other host.

Signal to other side whether each bit in sequence is
0 or 1.

To communicate, need each end to each access a common substrate, and
then for the sender to change the value of a physical characteristic
of the substrate.

Examples - voltage level of a piece of wire, light level in optical
fibre, frequency of radio wave in air

2.3.1.1 Signal characteristics

If physical characteristic has two levels then signal is binary.

If it has more than one level, can encode a number of bits to each
level.

If 4 levels, then level 0 = 00, level 1 = 01, level 2 = 10, level 3 =
11

If 8 levels, how many bits can be encoded?

2.3.1.2 Bandwidth

Adjusting the level at one end, and recognising the level has changed
at the other takes finite time.

The finite time limits the speed at which bits can be signalled.

The rate of signalling is the bandwidth, measured in bits per second.

If it takes 20 microseconds to raise the voltage on a wire, and 10
milliseconds to recognise the new level, what is the possible
bandwidth if there are two levels? Eight levels?

2.3.1.3 Noise and errors

Can we get infinite bandwidth by increasing the number of levels?

No, since noise makes it impossible to correctly distinguish level.

Noise is random changes in the physical characteristic to which all
physical phenomena are prone.

Always some probability that level will be misinterpreted.

Always some probability of error in message.

Goal of communications engineers is to make this probabilty as low as
necessary.

2.3.1.4 Latency

Does signal propagate along wire infinitely fast?

No, limit to speed of propagation of light. Hence described
as propagation delay.

Latency is time taken for signal to travel from one end of
communication system to destination.

Since communication system may reconstruct message at intermediate
points, time taken to reconstruct message is also part of latency.
Known as switching delay

Your Task: Describe the bandwidth, propagation delay and
switching delay in a game of chinese whispers.

2.3.2 Packets

If there is an error in a message, how is it detected and rectified?

Compute a checksum over the message
send the checksum with the message
Calculate a new checksum over the received message
Compare the checksums - if different, then message is in error
Ask for the message to be resent

Probability of error rises with length of message,

Thus message sent in separate lumps with maximum number of bits
per lump, known as packets.

If the message fits in one packet, good. Otherwise message is in many
packets.

2.3.2.1 Addressing

How do we direct the packet to correct recipient(s)?

Put header bits on front of packet, analogous to address and
other information on envelope in postal system. Add source of
packet to allow returns
Destination | Source | Packet body
If destination and source share the same physical medium, then
destination can listen for its own address (ethernet, wireless,
single point to point link)
If they don't share the same LAN, we use routing

2.3.2.2 Names, addresses and routes

NameIdentifier of object eg Ian Wakeman, Mum
AddressLocation of object eg Rm 4C6, COGS
RouteHow to get there from here - ``turn left, first right and
up stairs, first door on left''

Internet: Name is a Domain Name, such as www.cogs.susx.ac.uk. More
later.

To get packet through network, turn name into address by asking Domain
Name Service (DNS).

Place address in packet header and hand to nearest router.

Router locates MAC address corresponding to IP address, if they're on
the same LAN.

2.3.2.3 Indirect Addressing

$\includegraphics{figs/indirect-addresses}$

2.3.2.4 Architecture of a switch

$\includegraphics{figs/switch}$

A switch is a specialised computer. Can turn a PC into a router,
using free software.

A packet arrives on a link
The switch gets the destination of the packet from the packet
header
The switch looks up the destination in a routing table in
memory and discovers the output link
The switch queues the packet to be sent on the output link

2.3.2.5 Distributed Routing Table Maintenance

The situation:

People are network administrators (and end users)
Communication systems are links (possibly multipoint)
Machines are switches
Messages may be lost

The problem: Given an address, how does a router know which output
link to send the packet on?

Choices:

Packet could contain list of all output links - source routing.
Requires source to lookup and determine route. May be good thing.
Router could look up address in local routing table, and send out of
corresponding link. If not in table, send out default link.

How do we construct tables? Could install entries by hand, known as
static routing.

But

Limited to entries people know about.
Unable to ensure consistency and absence of loops
Unable to respond to changing topology, sharks gnawing through undersea
cables etc.
Internet has no centralised authority

So we use distributed routing algorithm

2.3.2.6 Distance Vector Routing

Each switch knows its own address
Each link has''cost'', such as a value of 1 per link, or measure
of delay.
Each switch starts out with distance vector, consisting of 0 for
itself, and infinity for everyone else
Switches exchange distance vectors with neighbour switches, and
whenever info changes
Switch saves most recent vector from neighbours
Switch calculates own distance vector by examining cost from
neighbour and adding cost of link to neighbour
Use link with minimum cost to destination as link to route out.

Examples include RIP and BGP.

2.3.2.7 Link State Routing

Each switch knows addresses that are direct neighbours
Switch constructs packets saying who are neighbours - link
state packets.
Link state packets flooded to all other switches
Switch constructs complete graph using most recent link state
packets from all other switches
Use Dijkstra shortest path to figure out routing table.

Examples include OSPF.

2.3.3 Conclusion: Network Properties

Packet Switched Networks present certain fundamental problems to the distributed systems programmer:

Whilst switches converge on a consistent set of routes, packets
will bounce around in the network, and suffer delays or get dropped.
Switches shared with other packets, therefore get queues.
Total latency is therefore variable.
Loss rate is variable (noise, available switch buffers).
Queue size management (aka congestion control) changes the bandwidth available to machines. Therefore bandwidth is variable.

2.8 Computer Security: Why you should never trust a
computer system

Goal: Prevent Misuse of computers

Definitions
Authentication
Private and Public Key Encryption
Access Control and Capabilities
Enforcement of security policies
Examples of Security Problems

2.8.1 Definitions

Types of Misuse

Accidental
Intentional

Protection is to prevent either accidental or intentional misuse

Security is to prevent intentional misuse

Three pieces to security

AuthenticationWho user is
AuthorisationWho is allowed to do what
EnforcementEnsure that people only do what they are allowed to do

A loophole in any of these can cause problem eg

Log in as super-user
Log in as anyone, do anything
Can you trust software to make decisions about 1 and 2?

2.8.2 Authentication

Common approach: Passwords. Shared secret between two parties. Since
only I know password, machine can assume it is me.

Problem 1 system must keep copy of secret, to check against
password. What if malicious user gains access to this list of
passwords?

Encryption Transformation on data that is difficult to
reverse - in particular, secure digest functions.

2.8.2.1 Secure Digest Functions

A secure digest function h = H(M) has the following properties:

Given M, is is easy to compute h.
Given h, it is hard to compute M.
Given M, it is hard to compute M' such that
H(M) = H(M')

For example: Unix /etc/passwd file

Password
$\rightarrow$ one way transform
$\rightarrow$ encrypted
password

System stores only encrypted version, so ok if someone reads the file.
When you type in password, system encrypts password and compares
against the stored encrypted versions.

Over the years, password protection has evolved from DES, using a
well-known string as the input data, and the password as the key,
through to MD5 to SHA-1.

2.8.2.2 Passwords as Human Factors Problem

Passwords must be long and obscure.

Paradox: short passwords are easy to crack, but long
ones, people write down

Improving technology means we have to use longer passwords. Consider
that unix initially required only lowercase 5 letter passwords

How long for an exhaustive search?
26⁵ = 10, 000, 000

In 1975, 10 ms to check a password
$\rightarrow$ 1 day
In 2003, 0.00001 ms to check a password
$\rightarrow$ 0.1 second

Most people choose even simpler passwords such as English words - it takes
even less time to check for all words in a dictionary.

2.8.2.3 Some solutions

Extend everyone's password with a unique number (stored in
passwd file). Can't crack multiple passwords at a time.
Require more complex passwords, eg 7 letter with lower, upper,
number and special
70⁷ $\approx$ 8000 billion, or 1 day. But
people pick common patterns eg 6 lower case plus number.
Make it take a long time to check each password. For example,
delay every login attempt by 1 second.
Assign long passwords. Give everyone a calculator or smart card
to carry around to remember password, with PIN to activate. Need
physical theft to steal card.

Problem 3 Can you trust the encryption algorithm? Recent
example: techniques thought to be safe such as DES, have back doors
(intentional or unintentional). If there is a back door, you don't
need to do complete exhaustive search.

2.8.3 Authentication in distributed systems: Private Key Encryption

In distributed systems, the network between the machine on which the
password is typed and the machine the password is authenticating on is
accessible to everyone.

Two roles for encryption:

Authentication - do we share the same secret?
Secrecy - I don't want anyone to know this data (eg medical
records)

Use an encryption algorithm that can easily be reversed given the
correct key, and difficult to reverse without the key

$\includegraphics{figs/private}$

From cipher text, can't decode without password.
From plain text and cipher text, can't derive password.
As long as the password stays secret, we get both secrecy and
authentication.

2.8.3.1 Symmetric Encryption

Symmetric encryptions use exclusive ors, addition, multiplication,
shifts and transpositions, all of which are fast on modern processors.
DESThe Data Encryption Standard (DES) was released in 1977.
The 56 bit key is too weak for most uses now, and instead, 3DES or
triple DES is used, which has 128 bit keys.
IDEAThe International Data Encryption Algorithm has 128 bit
keys, and has proven strong against a large body of analysis.
AESThe US based NIST has defined a new encryption algorithm
based on Rijndael, which offers 128, 192 or 256 bit keys.

2.8.3.2 Authentication Server - Kerberos Operation

The server keeps a list of passwords, provides a way for two parties,
A, B to talk to each other, as long as they trust the server.

Notation: Kxy is a key for talking between x and y. (..) $\wedge$ K means
encrypt message (..) with key K.

Simplistic overview:

A asks server for key:

A
$\rightarrow$ S (Hi! I'd like a key for A,B)
Server gives back special ``session'' key encrypted in A's key, along
with ticket to give to B:

S
$\rightarrow$ A (Use Kab (This is A! Use Kab) $\wedge$ Ksb) $\wedge$ Ksa
A gives B the ticket:

A
$\rightarrow$ B ((This is A! Use Kab) $\wedge$ Ksb

Details: add in timestamps to limit how long each key exists, use
single use authenticators so that clients are sure of server
identities, and to prevent machine replaying messages later.

Also have to include encrypted checksums to prevent malicious user
inserting garbage into message.

2.8.4 Public Key Encryption

Public key encryption is a much slower alternative to private key;
separates authentication from secrecy.

Each key is a pair K,K-1

With private key: (text) $\wedge$ K $\wedge$ K = text

With public key:

(text) $\wedge$ K $\wedge$ K-1 = text,
but (text) $\wedge$ K $\wedge$ K $\neq$ text
(text) $\wedge$ K-1 $\wedge$ K = text,
but (text) $\wedge$ K-1 $\wedge$ K-1 $\neq$ text

Can't derive K from K-1, or vice versa

2.8.4.1 Public key directory

Idea: K is kept secret, K-1 made public, such as public directory

For example: (I'm Ian) $\wedge$ K Everyone can read it, but only I
could have sent it (authentication).

(Hi!) $\wedge$ K-1 Anyone can send it but only I can read it
(secrecy).

((I'm Ian) $\wedge$ K Hi!) $\wedge$ K'-1

On first glance, only I can send it, only you can read it.
What's wrong with this assumption?

Problem: How do you trust the dictionary of public keys? Maybe
somebody lied to you in giving you a key.

2.8.5 Secure Socket Layer

Provides a techniques for data sent over a TCP connection to be
encrypted.
Uses public key technology to agree upon a key, then 3DES or
whatever to encrypt the session.
Data encrypted in blocks, optionally with compression first.
Used in http as https, and for telnet, ftp etc.

2.8.5.1 SSL handshake protocol

$\includegraphics{figs/ssl}$

2.8.6 Authorisation

Who can do what...

Access control matrix: formalisation of all the permissions in the
system

objects	file1	file2	file3	...
users
A	rw	r
B		rw
C			r
...

For example, one box represents C can read file3

Potentially huge number of users and objects, so impractical to store
all of these.

2.8.6.1 Approaches to authorisation

Access control list - store all permissions for all users with
each object

Still might be large number of users. Unix addresses by having rwx
for user group world. Recent systems provide way of specifying
groups of users and permissions for each group
Capability list - each process stores tickets for the objects it
can operate on.

2.8.6.2 Digital Rights Management

Digital content is produced by people earning a living, and they
wish to protect their investment
Digital Rights Management is the catchall term for the
technologies controlling the use of digital content.
Two main approaches:
Containmentin which the content is wrapped is an encrypted
shell, so that users have to prove their capability of
using the content through knowledge of the key.

Watermarkingwhere the content is marked so that devices know
that the data is protected.
The problem is how to enforce the checking of
capabilities and enforce the no circumvention requirement.

2.8.7 Enforcement

Enforcer is the program that checks passwords, access control lists,
etc

Any bug in enforcer means way for malicious user to gain ability to do
anything

In Unix, superuser has all the powers of the Unix kernel - can do
anything. Because of coarse-grained access control, lots of stuff has
to run as superuser to work. If bug in any program, you're in
trouble.

Paradox:

make enforcer as small as possible - easier to get correct, but
simple minded protection model (more programs have high privilege).
fancy protection - only minimal privilege, but hard to get
right.

2.8.8 Trusted Computing Platform

Question: How do we ensure no software transgresses digital
rights?

Answer: By only allowing approved software to have access to
the data.

The basic computer and operating system must be running
validated software.
There must be an untamperable part of the computer/OS that can
hold keys and hand over only to validated software.
The untamperable part of the computer is the Fritz chip, a
co-processor which holds a unique certificate that it is running
some validated code

2.9 Names and Naming Services

2.9.1 Main Points

Use of names
Structure of names
Name Services
Domain Name Service (DNS) - The Internet Name Service

Definitions:-

Names- what its called
Address- where it is
Route- how to get there

2.9.2 Why names?

Object Identificationa service or resource we want to use, eg
a filename, a telecommunications provider, a person
Allow SharingCommunicating processes can pass names and thus
share resources
Location IndependenceIf we separate the name from the address,
can migrate object transparently
SecurityIf large number of possible names, knowing the name of
the object means that it must explicitly have been passed. If
entire system constructed so that names are passed with
authorisation then knowing name means chain of trust to allow
access to object.

2.9.3 What does one do with names?

Use as arguments to functions, eg to call a service
Create names. If an object comes into creation it must be
named. Name should meet rules for system, eg be unique, therefore
naming authority (possibly object) must keep track of what names it
has allocated
Delete names. When an object disappears from the system, at
some point may wish to delete name and allow re-use.

2.9.4 What's a name?

Part of the design space for any distributed system is how to
construct the name.
Choice of design has ramifications for design of rest of system
and performance of service.

2.9.4.1 Unique Identifier

Unique Identifier (UID) aka flat names, primitive names.
No internal structure, just string of bits.
Only use is comparison against other UIDs eg for lookup in table
containing information on named object.
Provide location independence and uniformity
But Difficult to name different versions of the same
object eg Distributed systems edition 1 vs Distributed Systems
edition 2. Real objects have relationships to each other - useful
to reflect in naming practice
Difficult to discover address from name - need to search entire
name space

2.9.5 Partitioned names

Add partition structure to name to enable some function to be
more efficient, typically location and name allocation

$\includegraphics{figs/dns}$

Domain Name Service (DNS) name - unix.rn.informatics.scitech.sussex.ac.uk.
Each part of name comes from flat name space. Division of name space
and possible objects into smaller space. "uk" reduces objects to
those within the uk, "ac" to those administered by academic
networks, and so on.
When allocating names, simply add to lowest part of partition.
Low risk of collision, since small number of other objects, all
administered by same authority
When looking up name, can guess where information will reside.

2.9.6 Descriptive names

Necessary to have a unique name.
But useful to name objects in different ways, such as by service
they offer eg Postman Pat, John of Gwent, www.informatics.sussex.ac.uk.
Create name using attributes of object.
Note that objects can therefore have multiple names, not all of
which are unique
Choice of name structure depends on system. DNS chooses
partition according to administration of creation, with aliases to
allow naming by service eg ftp.informatics.sussex.ac.uk is also
doc-sun.rn.informatics.scitech.sussex.ac.uk

2.9.7 Object Location from name - broadcast

Ask all possible objects within the system if they respond to
that name. Can be efficient if network supports broadcast eg
Ethernet and other LANs.
Equivalent to distributing name table across all objects,
objects storing names referring to them.
Only want positive responses - all responses would generate a
lot of traffic.
Scaling problem when move into wide area.
- Greater number of hosts imply greater probability of failure
- Broadcasts consume higher proportion of bandwidth, made worse
  due to higher failures needing more location requests
Broadcast used only on small LAN based systems (and in initial
location of directory)

2.9.8 Location through database

Keep information about object in a table, indexed by a name. Location
is just another piece of information.
In DNS, table is stored as {name,attribute} pairing, in a resource record
If database centralised, then
- Whole system fails if database machine fails
- Database machine acts as bottleneck for performance of system as
  whole
- In wide area systems, authority should be shared amongst
  controlling organisations
So name information usually distributed.

2.9.9 Distributed Name Servers

Parts of the name table are distributed to different servers eg. in
Domain Name Service, servers are allocated portions of the name space
below certain domains such as the root, ac.uk, susx.ac.uk

Names can be partitioned between servers based on

algorithmic clusteringeg apply well-known hash function on name to map to server. May
result in server being remote from object. Only technique for UIDs
structural clusteringif name is structured, use structure to designate names to particular server, such as in DNS
attribute clusteringif names are constructed using attributes, then servers take responsibility for certain attributes.

2.9.10 Availability and performance

If a single server is responsible for name space, there is still single
point of failure, and a performance bottleneck.

Most systems therefore

Replicate name list to other servers. Also increases
performance for heavily accessed parts of name space eg secondary
servers in DNS
Cache information received by lookup. No need to repeat lookup
if asked for same information again. Increases performance.
Implemented in both client side and server (in recursive calls) in
DNS.

If information is cached, how do we know when its invalid? May
attempt to use inconsistent information.

2.9.11 Maintaining consistency for distributed name services

Alleviated by following features of some distributed systems

In most systems objects change slowly, so names live for a long
time, and are created infrequently
If address of an object is wrong, it causes an error. Address user
can recover if it assumes one of the possible problems is inconsistent
information.
Obsolete information can be fixed by addressed object leaving
redirection pointers. Equivalent to leaving a forwarding address to new home.

However, there are always systems which break these assumptions eg
highly dynamic distributed object system, creating lots of objects and
names and deleting lots of names.

2.9.12 Client and Name server interaction.

Hidden behind RPC interface.
Client knows of server to ask (either installed in file, or
through broadcast location).
Client calls with arguments of name and required attribute, eg
address. In DNS arguments are name and the type of requested
attribute.
Server will return result with either required attribute or
error message

2.9.12.1 Lookup Modes

If name not stored on server, may be stored on other server. Two options

Recursiveserver asks other possible server about name and
attribute, which may then have to ask another server and so on

$\includegraphics{figs/recursive}$

Iterativeserver returns address of other possible server to client, who then resends request to new server.

$\includegraphics{figs/iterative}$

2.9.13 Summary

Names can be flat, or they can be structured
Centralised name servers suffer from availability - distributed
name servers suffer from inconsistency
Interaction with name server best modelled by RPC

2.10 Distributed File Systems

2.10.1 Main Points

A Distributed File System provides transparent access to files
stored on a remote disk

Themes:

Failures - what happens when server crashes but client doesn't?
or vice versa
Performance
$\Rightarrow$ Caching - use caching at both the
clients and the server to improve performance
Cache Coherence - how do we make sure each client sees most up
to date copy of file?

2.10.2 Client Implementation

Request for operation on file goes to OS.
OS recognises file is remote and constructs RPC to remote file server
Server receives call, does operation and returns results.
Subtrees in directory structure generally map to file system.
Provides access transparency

$\includegraphics{figs/remote-dir}$

2.10.3 No Caching

Simple approach: Use RPC to forward each file system request to remote
server (older versions of Novell Netware).
Example operations: open, seek, read, write, close
Server implements operations as it would for local request and passes
result back to client

2.10.3.1 Advantages and Disadvantages of uncached remote file
service

Advantageserver provides consistent view of file system to both A
and B.
Disadvantagecan be lousy performance

Going over network is slower than going through memory
Lots of network traffic - congestion
Server can be a bottleneck - what if lots of clients

2.10.4 NFS - Sun Network File System

Main idea - uses caching to reduce network load

Cache file blocks, file headers etc in both client and
servers memory.
More recent NFS implementations use a disk cache at the client
as well.

$\includegraphics{figs/nfs-fs}$

Advantage

Advantage: If open/read/write/close can be done locally, no network
traffic

Issues: failure and cache consistency

2.10.4.1 Failures

What if server crashes? Can client wait until server comes back up
and continue as before?

Any data in server memory but not yet on disk can be lost
If there is shared state across RPCs, eg open seek read. What
if server crashes after seek? Then when client does ``read'' it will
fail.
Message re-tries - suppose server crashes after it does ``rm
foo'' but before it sends acknowledgement? Message system will
retry and send it again. How does system know not to delete it
again (someone else may have created new ``foo''). Could use
transactions, but NFS takes more ad hoc approach.

What if client crashes?

Might lose modified data in client cache.

2.10.4.2 NFS Protocol

Write-through caching - when a file is modified, all modified
blocks are sent immediately to the server disk. To the client,
``write'' doesn't return until all bytes are stored on disk.
Stateless Protocol - server keeps no state about client, except
as hints to improve performance
- Each read request gives enough information to do entire operation -
  ReadAt(inumber, position) not Read(openFile).
- When server crashes and restarts, can start processing requests
  immediately, as if nothing had happened.
Operations are idempotent: all requests are ok to repeat
(all requests are done at least once). So if server crashes between
disk I/O and message send, client can resend message and server
does request over again
- Read and write file blocks are easy - just re-read or re-write file
  block, so no side-effects.
- What about ``remove''? NFS just ignores this - does the remove twice
  and returns error if file not found, application breaks if
  inadvertently removes other file.
Failures are transparent to client system.

Is this good idea? What should happen if server crashes? If
application in middle of reading file when server crashes, options:
1. Hang until server returns (next week...)
2. return an error? But networked file service is transparent.
  Application doesn't know network is there. Many Unix Apps ignore
  errors and crash if there is a problem.
NFS has both options - the administrator can select which one.
Usually hang and return error if absolutely necessary.

2.10.4.3 Cache consistency

What if multiple clients sharing same files? Easy if they are both
reading - each gets a local copy in their cache.
What if one writing? How do updates happen?
Note NFS has write-through cache policy. If one client modifies file,
writes through to server.
How does other client find out about change?

2.10.4.4 NFS and weak consistency

In NFS, client polls server periodically to check if file has changed.
Polls server if data hasn't been checked in every 3-30 seconds (Exact
time is tunable parameter)

$\includegraphics{figs/nfs-cache}$
When file changed on one client, server is notified, but other clients
use old version of file till timeout. They then check server and get
new version.
What if multiple clients write to same file? In NFS get either
version or mixed version. Completely arbitrary!

2.10.4.5 Sequential ordering constraints

What should happen? If one CPU changes file, and before it completes,
another CPU reads file?
We want isolation between operations, so read
should get old file if it completes before write starts, new version if it
starts after write completes. Either all new or all old any other
way cf serialisability.

2.10.4.6 NFS Pros and Cons

its simple
its highly portable
but its sometimes inconsistent
but it doesn't scale to large numbers of clients

Note that this describes NFS version 3.

2.10.5 Andrew File System

AFS (CMU late 80s)
$\Rightarrow$ DCE DFS (commercial product)

Callbacks: Server records who has copies of file
Write through on close
- If file changes, server is updated (on close)
- Server then immediately tells all those with old copy

2.10.5.1 AFS Session Semantics

Session semantics - updates only visible on close

In Unix (single machine), updates visible immediately to other
programs who have file open.
In AFS, everyone who has file sees old version, anyone who opens file
again will see new version.

In AFS:
1. on open and cache miss: get file from server, set up callback
2. on write close: send copy to server; tells all clients with
  copies to fetch new version from server on next open
Files cached on local disk; NFS (used to) cache only in memory

$\includegraphics{figs/AFS-cache}$

What if server crashes? Lose all your callback state.
Reconstruct information from client - go ask everyone ``who has which
files cached''

2.10.5.2 AFS Pros and Cons

Disk as cache
$\Rightarrow$ more files cached locally
Callbacks
$\Rightarrow$ server not involved if file is
read-only (Majority of file access is read-only)
But on fast LANs, local disk is slower than remote memory

NFS version 4 will provide session semantics when it is
deployed by vendors in the 2005 timeframe.

2.10.6 Summary

Remote file performance needs caching to get decent performance.
Central server is a bottleneck
- Performance bottleneck:
  - All data is written through to server
  - all cache misses go to server
- Availability Bottleneck
  - Server is single point of failure
- Cost bottleneck
  - Server machines high cost relative to workstation

2.12 Content Distribution Networks

Main Points

Building content caches
Pre-fetching data
Using your neighbours - BitTorrent

2.12.1 Getting Content over a Network

Image cdn-base

Users want to download content from serversas quickly as possible
What structures can we use to improve their experience, and the
impact upon the network?

2.12.2 Web Caches

Large organisations can improve web performance by sticking a
web cache in front of HTTP connections
The cache inspects an incoming HTTP request to see if it can
satisfy the request from locally cached objects.
If yes, then the object is returned, and the last reference time
is updated
If no, then the object is retrieved and copied locally if allowed.
Web objects can be marked as non-cacheable.
Caching is a major part of the HTTP standard

2.12.2.1 Cache Performance

The performance of a web cache is difficult to model, since the
performance is a complex mixture of interaction between TCP, HTTP
and content.
Caches work because of temporal locality, due to popularity of
content, and spatial locality, due to structure of HTML documents
Measurements of web proxies give the following results (based on
JANET caches)
- Request hit rate is about 60%.
- Volume hit rate is about 30%.
- Latency improvement is around a factor of 3 on average

2.12.2.2 Problems with Caching

Not all content is marked as cacheable, eg because the site
wants accurate records of who looks at content.
All hosts behind a cache appear to come from one
address.

Question: Why is this a problem?

2.12.3 Pre-fetching Data

Can we improve performance by pro-actively distributing content
to caches?
Yes...

2.12.3.1 Active Content Distribution

The html uses links to the cdn domain name eg akamai.com
The internet service provider has entries in their local DNS for
akamai.com pointing to a machine on the ISP's network.
Content will therefore be supplied from the local machine rather
than the original machine
Customer interaction improved.
Bandwidth requirements of servers reduced

2.12.4 Using your Peers: BitTorrent

Why not use the other people receiving the content?
BitTorrent downloads from other machines
Basic Idea:
- To download, the host contacts a machine tracking those already
  downloading the torrent.
- Peers are selected at random from the tracker.
- Pieces are selected to download from those available on the
  downloader's peer set until all pieces of the file have been received

2.12.4.1 The .torrent file

To start downloading from a torrent, the consumer must
first locate a .torrent file.
This contains information about the file length, name and
hashing numbers of the file blocks, and the url of a tracker
The file is split into 250 KByte pieces, each having a SHA1 hash
calculated.
A tracker holds the IP addresses of current downloaders

2.12.4.2 Locating peers

After receiving the torrent file, the downloader contacts the
tracker
The tracker inserts the downloader into its list of downloaders,
and returns a random list of other downloaders
This list becomes the downloader's peers
Question What is the shape of the overlay Graph?

2.12.4.3 Choosing Pieces

The downloader will contact its peers to discover what pieces
they have.
It then chooses a piece to download:
The first choice is made randomly, so as to spread load
Subsequent pieces are based on a rarest-first approach to
increase probability all pieces are available.
When a peer has downloaded a new piece which matches the SHA1
hash, it notifies its peers it has a new piece.

2.12.4.4 Choosing Downloaders

A request to upload is accepted if the requester recently
uploaded to it
This provides an incentive to machines to share data
Periodically other machines are tried for upload

2.12.5 Summary

Web caching improves performance by a reasonable factor,
dependent on situation
Pro-active content distribution can reduce latency and improve
bandwidth usage for popular services
BitTorrent can improve bandwidth usage by spreading load across
peers.

2.13 Replication: Availability and Consistency

Motivation for replication
Multicasting updates to a group of replicas
Total Ordering
Causal Ordering
Techniques for ordering protocols
ISIS CBCAST

2.13.1 What is Replication?

Multiple copies of dynamic state stored on multiple machines
eg Copies of files stored on different machines, name servers storing
name address mappings
Caching can be seen as a form of replication.

2.13.1.1 Why is Replication used?

Performance enhancement

Single Server acts as a bottleneck - if we can balance load
amongst multiple servers, get apparent performance gain
If clients are geographically distributed, we can site servers
near clients and reduce communication costs

Availability

If a machine fails, then we can still provide a service
Probability of total failure reduced such as all data being
lost, since data replicated across multiple machines
If probability of failure is pr(fail ) for a given machine
in n machines, then probability of loss of service is
pr(fail )ⁿ and the availability of the service is
1 - pr(fail )ⁿ
eg, if mean time between failure for 3 machines is 5 days,
repair time is four hours, then assuming independence of failure,

pr(fail )= ${\frac{{4}}{{5 \times 24}}}$ = 0.03.

Availability =
1 - 0.03³ = 99.996%

Fault ToleranceEven in the presence of failure, the service
will continue to give the correct service

Stronger than availability, since can provide real-time
guarantees (with extra work!)
Can protect against arbitrary failure where machines feed
wrong information (Byzantine Failure)

2.13.2 Issues in Replication

A collection of replicas should behave as if state was stored
at one single site

When accessed by client, view should be consistent
Replication should be transparent - client unaware that servers
are replicated

If we are providing a replica service, replica can be passive or
active.

Passive replicas are standbys, to maintain service on failure. No
performance improvement.

Standbys must monitor and copy state of active server

Provide availability in simple manner.

Used for highly available systems eg space applications

2.13.3 Consistency

Clients can modify resource on any of the replicas.
What happens if another client requests resource before replica
has informed others of modification, as in cache consistency in
distributed file systems?
Answer depends upon application...

2.13.3.1 Example Distributed Bulletin Board System (BBS)

$\includegraphics{figs/bbs}$

Users read and submit articles through Front End.
Articles replicated across a number of servers
Front Ends can connect to any server
Servers propagate articles between themselves so that all
servers hold copies of all articles.
User membership of a given bbs is tightly controlled.

Questions on BBS:

How should messages be passed between replicas?
Should order of presentation of articles to clients be the same
across all replicas? Are weaker ordering semantics possible?
When a client leaves bbs group, can they see articles submitted
after they have left? Is this desireable?
What should happen when replicas are temporarily partitioned?

2.13.4 Updating Server state

Clients read and update state at any of the replicated servers eg
submit messages in bbs. To
maintain consistency, three things are important

Multicast communicationMessages delivered to all servers in
the group replicating data
Ordering of messagesUpdates occur in the same ``order'' at
each server
Failure recoveryWhen servers or the network fails, and comes
back, the replicas must be able to regain consistency. Done through
Voting and Transactions (later in course)

2.13.5 Multicast and Process Groups

A Process Group: a collection of processes that co-operate
towards a common goal.

Multicast communication: One message is sent to the members of a
process group

Idea: Instead of knowing address of process, just need to know an
address representing the service. Lower levels take care of routing
messages.

Useful for:

Replicated ServicesOne update message goes to all replicas,
which perform identical operations. Reduces communication costs.
Locating objects in distributed servicesRequest for object
goes to all processes implementing service, but only process holding
object replies.

2.13.5.1 Group Services

Maintenance of group information is a complex function of the name
service (for tightly managed groups)

Create GroupCreate a group identifier that is globally
unique.
Join GroupJoin a group. Requires joining process information
to be disseminated to message routing function. May require
authentication and notification of existing members.
Leave GroupRemove a process from a group. May require
authentication, may occur as a result of failure or partition. Need
to notify message routing function, may notify other members.
Member ListSupply the list of processes within a group.
Needed for reliable message delivery, may require authentication.

2.13.6 Message Ordering

If two processes multicast to a group, the messages may be arbitrarily
ordered at any member of the group.

Process P1 multicasts message a to a group comprising processes P1, P2, P3 and
P4.

Process P2 multicasts message b to the same group

The order of arrival of a and b at members of the group can be different.

$\includegraphics{figs/order}$

2.13.6.1 Ordering example

$\includegraphics{figs/causal-object}$

Order of operations may be important - delete object, create
object.
If delete object arrives before create object, then operation
not completed

2.13.6.2 Ordering Definitions

Various definitions of order with increasing complexity in
multicasting protocol

FIFO OrderingMessages from one process are processed at all
group members in same order
Causal OrderingAll events which preceded the message
transmission at a process precede message reception at other
processes. Events are message receptions and transmissions.
Total OrderingMessages are processed at each group member in
the same order.
Sync OrderingFor a sync ordered message, either an event
occured before message reception at all processes, or after
message. Other events may be causally or totally ordered.

2.13.6.3 FIFO ordering

Achieved by process adding a sequence number to each message.

Group member orders incoming messages with respect to sequence number.

Applicable when each process state is separate, or operations don't
modify state, just add incremental updates or read.

2.13.6.4 Total Ordering

When several messages are sent to a group, all members of the group
receive the messages in the same order.

Two techniques for implementation:

Sequencer
Elect a special sequencing node. All messages are
sent to sequencer, who then sends messages onto replicas. FIFO
ordering from sequencer guarantees total ordering. Suffers from
single point of failure (recoverable by election) and bottleneck.

Holdback Queue
Received messages are not passed to the
application immediately, but are held in a holdback queue
until the ordering constraints are met.

2.13.6.4.1 Sequence Number Negotiation

Sender negotiates a largest
sequence number with all replicas.

Replicas store largest final sequence number yet seen F_max,
and largest proposed sequence number P_max
Sender sends all replicas message with temporary ID.
Each Replica i replies with suggested sequence number of
max(F_max, P_max) + 1. Suggested sequence number
provisionally assigned to message and message placed in holdback
queue (ordered with smallest sequence number at front)
Sending site chooses largest sequence number and notifies
replicas of final sequence number. Replicas replace provisional
sequence number with final sequence number.
When item at front of queue has an agreed final sequence number,
deliver the message.

2.13.6.5 Causal Ordering

``Cause'' means ``since we don't know application, messages might have
causal ordering''

a and b are events, generally sending and receiving of
messages.

We define the causal relation,
a $\rightarrow$ b, if

if a and b are events at the same process,

a $\rightarrow$ b implies a happened before b
if a is a message sent by process P1 and b is
the arrival of the same message at P2, then
a $\rightarrow$ b is true

In bulletin board, an article titled ``re: Multicast Routing'' in
repsonse to an article called ``Multicast Routing'' should always come
after, even though may be received before the initial article

2.13.6.6 CBCAST - Causal ordering in ISIS

ISIS is a real commercial distributed system, based on process groups.

Causal ordering for multicast within a group is based around
Vector Timestamps

The vector VT has an identifier entry for each member of the
group, typically an integer.

Vector timestamps have one operation defined

merge(u, v)[k] = max(u[k], v[k]), for k = 1..n

Incoming messages are placed on a holdback queue, until all messages
which causally precede the message have been delivered.

2.13.6.6.1 CBCAST Implementation

All processes p_i initialise the vector to zero
When p_i multicasts a new message, it first increments VT_i[i] by 1; it piggybacks
vt = VT_i on the message
Messages are delivered to the application in process P_j when
- The message is the next in sequence from p_i i.e.
  vt[i] = VT_j[i] + 1
- All causally prior messages that have been delivered to p_i must have been delivered to p_j, i.e.
  VT_j[k] $\geq$ vt[k] for k $\neq$ i.
When a message bearing a timestamp vt is delivered to p_j, p_j's timestamp is updated as
VT_j = merge(vt, VT_j)

In words

Incoming vector timestamp is compared to current timestamp.
If conditions for delivery to process not met, then message placed on
holdback queue.
When an incoming message is delivered, the timestamp is updated by the
merge.
Examine all messages in the holdback queue to see if they can be
delivered.
CBCAST requires reliable delivery.

2.13.6.6.2 Causal Example

$\includegraphics{figs/causal}$

2.13.6.6.3 Group View Changes

When group membership changes, what set of messages should be
delivered to members of changed group?
What happens to undelivered messages of failed members?
What messages should new member get?
ISIS solves by sending a sync ordered message announcing
that the group view has changed. Messages thus belong to a
particular group view.
Use coordinator to decide which messages belong to which view.

2.13.7 Summary

Replication of services and state increase availability
Replication increases performance
Replication increases Fault tolerance
To maintain consistency, multicast updates to all replicas
Use sequence numbers to maintain FIFO ordering
Use Vector Timestamps to maintain Causal Ordering
Use elected sequencers or identifier negotiation to maintain
total ordering

2.14 Shared Data and Transactions

Stateful Servers
Atomicity
Transactions
ACID
Serial Equivalence

2.14.1 Servers and their state

Servers manage a resource, such as a database or a printer
Attempt to limit problems of distributed access by making server
stateless, such that each request is independent of other
requests.
- Servers can crash in between servicing clients
- Client requests cannot interfere with each other (assuming
  concurrency control in server
But we can't always design stateless servers...

2.14.1.1 Stateful Servers

Some applications better modelled as extended
conversations, eg retrieving a list of records in a
large database better modelled as getting batch of records at a
time.
If application requires state to be consistent across a number
of machines, then each machine must recognise when it can update
internal data. Needs to keep track of state of distributed
conversation
If long duration then, then need to be aware of state.
If other conversations need to go on - eg modify records during
retrieval, how do we allow concurrency?
What happens if machine fails - need to recover.
Should also aim to be fault tolerant

2.14.2 Atomicity

Stateful server have two requirements

Accesses from different clients shouldn't interfere
with each other
Clients should get fast access to the server

2.14.2.1 Definition

We define atomicity as
All or NothingA client's operation on a server's resource
should complete successfully, and the results hold thereafter (yea,
even unto a server crash), or it should fail and the resource should
show no effect of the failed operation
IsolationEach operation should proceed without interference
from other clients' operations - intermediate effects should not be
visible.

2.14.2.2 Example

Mutual ExclusionFor a multi-threaded server, if two or more
threads attempt to modify the same piece of data, then the updates
should have mutual exclusion around the updates to provide
isolation, using semaphores or monitors
SynchronisationIn situations such as Producer Consumer, need
to allow one operation to finish so second operation can use
results, needing isolation.

2.14.3 Automatic Teller Machines and Bank accounts

$\includegraphics{figs/cashmachine}$

An ATM or cashmachine allows transfer of funds between accounts.
Accounts are held at various machines belonging to different
banks
Accounts offer the following operations
depositPlace an amount of money in an account

withdrawTake an amount of money from an account

balanceGet the current value in an account
Operations implemented as read() and write() of values, so
withdraw x from A and deposit x in B implemented as
1. A.write( A.read() - x)
2. B.write( B.read() + x)

2.14.4 Transactions

Transactions are technique for grouping operations on data so that
either all complete or none complete

Typically server offers transaction service, such as:

beginTransaction(transId)Record the start of a transaction and
associate operations with this transId with this transaction.
commitTransaction(transId)Commit all the changes the
operations in this transaction have made to permanent storage.
abortTransaction(transId)Abort all the changes the transaction
operations have done, and roll back to previous state.

2.14.4.1 ACID

Transactions are described by the ACID mnemonic

AtomicityEither all or none of the Transaction's operations
are performed. If a transaction is interrupted by failure, then
partial changes are undone
ConsistencySystem moves from one self-consistent state to another
IsolationAn incomplete transaction never reveals partial state
or changes before commiting
DurabilityAfter committing, the system never loses the results
of the transaction, independent of any subsequent failure

2.14.4.2 Concurrency Problems

2.14.4.2.1 Lost Update

Transaction T		Transaction U
A.withdraw(4,T)		C.withdraw(3,U)
B.deposit(4,T)		B.deposit(3,U)
balance = A.read()	£100
A.write(balance - 4)	£96
		balance = C.read()	£300
		C.write(balance - 3)	£297
balance = B.read()	£200
		balance = B.read()	£200
		B.write(balance + 3)	£203
B.write(balance + 4)	£204

2.14.4.2.2 Inconsistent Retrievals

Transaction T		Transaction U
A.withdraw(100,T)		Bank.total(U)
B.deposit(100,T)
balance = A.read()	£200
A.write(balance - 100)	£100
		balance = A.read()	£100
		balance = B.read()	£300
		+ balance
		balance = C.read()	£300+
		+ balance
balance = B.read()	£200
B.write(balance + 100)	£300

2.14.5 Serial Equivalence

Definition: Two transactions are serial if all the operations in
one transaction precede the operations in the other.

eg the following actions are serial

R_i(x)W_i(x)R_i(y)R_j(x)W_j(y)

Definition: Two operations are in conflict if:

At least one is a write
They both act on the same data
They are issued by different transactions

R_i(x)R_j(x)W_i(x)W_j(y)R_i(y) has
R_j(x)W_i(x) in conflict

Definition: Two schedules are computationally equivalent if:

The same operations are involved (possibly reordered)
For every pair of operations in conflict, the same operation
appears first in each schedule

So, a schedule is serialisable if the schedule is computationally
equivalent to a serial schedule.

Question: Is the following schedule for these two transaction serially
equivalent?

R_i(x)R_i(y)R_j(y)W_j(y)R_i(x)W_j(x)W_i(y)

2.14.5.1 Transaction Nesting

Transactions may themselves be composed of multiple transactions

eg Transfer is a composition of withdraw and
deposit transactions, which are themselves composed of read and
write transactions

Benefits:

Nested transactions can run concurrently with other transactions
at same level in hierarchy
If lower levels abort, may not need to abort whole transaction.
Can instead use other means of recovery.

2.14.6 Summary

Transactions provide technique for managing stateful servers
Need to worry about concurrency control
Need to worry about aspects of distribution
Need to worry about recovery from failure

2.15 Concurrency Control and Transactions

Problem restatement
Locking
Optimistic control
Timestamping

2.15.1 Why concurrency control?

To increase performance, multiple transactions must be able to
carry on work simultaneously...
...but if data is shared, then can lead to problems such as lost
updates and inconsistent retrievals.
So we must ensure schedules of access to data for concurrent
transactions are computationally equivalent to a serial schedule of
the transactions.

2.15.2 Locking

As in operating systems, locks control access for different
clients
Granularity of data locked should be small so as to maximise
concurrency, with trade-off against complexity.
To prevent intermediate leakage, once lock is obtained, it must
be held till transaction commits or aborts

2.15.2.1 Conflict rules

Conflict rules determine rules of lock usage
If operations are not in conflict, then locks can be shared
$\Rightarrow$ read locks are shared
Operations in conflict imply operations should wait on lock
$\Rightarrow$ write waits on read or write lock, read waits on write
lock
Since can't predict other item usage till end of transactions,
locks must be held till transaction commits or aborts.
If operation needs to do another operation on same data then
promotes lock if necessary and possible - operation may
conflict with existing shared lock

2.15.2.2 Rules for strict two phase locking

When operation accesses data item within transaction
1. If item isn't locked, then server locks and proceeds
2. If item is held in a conflicting lock by another transaction,
  transaction must wait till lock released
3. If item is held by non-conflicting lock, lock is shared and
  operation proceeds
4. If item is already locked by same transaction, lock is
  promoted if possible (refer to rule b)
When transaction commits or aborts, locks are released

2.15.2.3 Locking Implementation

Locks generally implemented by a lock manager
lock(transId,DataItem,LockType)Lock the specified item if
possible, else wait according to rules above

unLock(transId)Release all locks held by the transaction
Lock manager generally multi-threaded, requiring internal
synchronisation
Heavyweight implementation

2.15.2.4 Example

Transactions T and U.

T:
R_T(i), W_T(j, 44)
U:
W_U(i, 55)), R_U(j), W_U(j, 66)

Question What are the possible schedules allowed under strict
locking?

Question Are there any schedules computationally equivalent
to a serial schedule which are disallowed?

2.15.2.5 Deadlocks

Locks imply deadlock, under following conditions
1. Limited access (eg mutex or finite buffer)
2. No preemption (if someone has resource can't take it away)
3. Hold and wait. Independent threads must possess some of its needed resources and waiting for the remainder to become free.
4. Circular chain of requests and ownership.
Most common way of protecting against deadlock is through
timeouts. After timeout, lock becomes vulnerable and can be
broken if another transaction attempts to gain lock, leading to
aborted transactions

2.15.2.6 Drawbacks of Locking

Locking is overly restrictive on the degree of concurrency
Deadlocks produce unnecessary aborts
Lock maintenance is an overhead, that may not be required

2.15.3 Optimistic Concurrency Control

Most transactions do not conflict with each other
So proceed without locks, and check on close of transaction that
there were no conflicts
- Analyse conflicts in validation process
- If conflicts could result in non-serialisable schedule, abort
  one or more transactions
- else commit

2.15.3.1 Implementation of Optimistic Concurrency Control

Transaction has following phases

Read phase in which clients read values and acquire tentative
versions of items they wish to update
Validation phase in which operations are checked to see if they
are in conflict with other transactions - complex part. If invalid,
then abort.
If validated, tentative versions are written to permanence, and
transaction can commit (or abort).

2.15.3.2 Validation approaches

Validation based upon conflict rules for serialisability
Validation can be either against completed transactions or
active transactions - backward and forward validation.
Simplify by ensuring only one transaction in validation and
write phase at one time
Trade-off between number of comparisons, and transactions that
must be stored.

2.15.3.2.1 Forward Validation

A transaction in validation is compared against all
transactions that haven't yet committed
Writes may affect ongoing reads
The write set of the validating transaction is compared
against the read sets of all other active transactions.
If the sets conflict, then either abort validating
transaction, delay validation till conflicting transaction
completes, or abort conflicting transaction.

2.15.3.2.2 Backward validation

Writes of current transaction can't affect previous
transaction reads, so we only worry about reads with overlapping
transactions that have committed.
If current read sets conflict with already validated
overlapping transactions write sets, then abort validating
transaction

2.15.4 Timestamping

Operates on tentative versions of data

Each Transaction receives global unique timestamp on initiation
Every object, x, in the system or database carries the maximum (ie
youngest) timestamp of last transaction to read RTM(x)^2.3 and maximum of last transaction to write
WTM(x)^2.4
If transaction requests operation that conflicts with younger
transaction, older transaction restarted with new timestamp.
Transactions committed in order of timestamps, so a transaction
may have to wait for earlier transaction to commit or abort before
committing.
Since tentative version is only written when transaction is
committed, read operations may have to wait until the last
transaction to write has committed.

An operation in transaction T_i with start time TS_i
is valid if:

The operation is a read operation and the object was last
written by an older transaction ie
TS_i > WTM(x). If read permissible,
RTM(x) = MAX(TS_i, RTM(x))
The operation is a write operation and the object was
last read and written by older transactions ie

TS_i > RTM(x) and
TS_i > WTM(x). If permissible,
WTM(x) = TS_i

2.15.5 Summary

Locks are commonest ways of providing consistent concurrency
Optimistic concurrency control and timestamping used in some
systems
But, consistency in concurrency is application dependent - for
shared editors, people may prefer to trade speed of execution
against possibilities of conflict resolution. Problems can occur
with long term network partition. Approaches based on notification
and people resolution becoming popular.

2.16 Distributed Transactions

Models for distributed transactions
Attaining distributed commitment
Distributed Concurrency Control

2.16.1 Single Server Transactions

$\includegraphics{figs/single-server-trans}$

Till now, transactions have referred to multiple clients, single
server.
How do we have multiple clients interacting with multiple
servers? eg complicated funds transfer involving different accounts
from different banks?
Generalise transactions to distributed case...

2.16.2 Distributed Transactions

2.16.2.1 Distributed Transaction Requirements

General characteristics of distributed systems

Independent Failure Modes
No global time
Inconsistent State

Need to consider:

how to achieve distributed commitment (or abort)
how to achieve distributed concurrency control

2.16.2.2 Models

$\includegraphics{figs/transaction-models}$

If client runs transactions, then each transaction must complete
before proceeding to next
If transactions are nested, then transactions at same level can
run in parallel
Client uses a single server to act as coordinator for all
other transactions. The coordinator handles all communication with
other servers

Question: What are the requirements of transaction ids?

2.16.3 Atomic Commit Protocols

Distribution implies independent failure modes, ie machine can
fail at any time, and others may not discover.
If one phase commit, client requests commit, but one of
the server may have failed - no way of ensuring durability
Instead, commit in 2 phases, thus allowing server to request abort.

2.16.3.1 2 Phase Commit

One coordinator responsible for initiating protocol.
Other entities in protocol called participants.
If coordinator or participant unable to commit, all parts of
transaction are aborted.
Two phases
Phase 1Reach a common decision

Phase 2Implement that decision at all sites

2.16.3.1.1 2 Phase Commit Details

Phase 1 The coordinator sends a Can Commit? message to all
participants in transaction.
Participants reply with vote yes or no. If vote
is no participant aborts immediately.
Phase 2 Coordinator collects votes including own:
1. If all votes are yes, coordinator commits transaction
  and sends DoCommit to all participants.
2. Otherwise transaction is aborted, and coordinator sends abortTransaction to all participants.
When a participant recieves DoCommit, it commits its
part of the transaction and confirms using HaveCommited

2.16.3.1.2 2 Phase Commit Diagram

$\includegraphics{figs/2phasecommit}$

Note:

If participant crashes after having voted to commit, it can ask
coordinator about results of vote.
Timeouts are used when messages are expected.
Introduces new state in transaction Prepared to commit.

2.16.4 Distributed Concurrency Control

2.16.4.1 Locking

Locking is done per item, not per client.
No problems generalising to multiple servers...
...except in dealing with distributed deadlock
Same techniques as usual, but interesting dealing with
distributed deadlock detection.

2.16.4.2 Optimistic Concurrency Control

Need to worry about distributed validation
Simple model of validation had only one transaction being
validated at a time - can lead to deadlock if different cordinating servers
attempt to validate different transaction.
Also need to validate in correct serialisable order.
One solution is to globaly only allow one transaction to
validate at a time.
Other solutions is to validate in two phases with timestamp
allocation - local, then global to enforce ordering.

2.16.4.3 Timestamping

If clocks are approximately synchronised, then timestamps can be

< localtimestamp, coordinatingserverid >

pairs, and an ordering
defined upon server ids.

2.16.5 Summary

Nested Transactions are best model for distributed transactions
Two Phase Commit protocol suitable for almost all case
Distributed Concurrency control is only slightly more diffcult
than for single server case

2.17 Transactions: Coping with Failure

Failure Modes
Recovery Techniques
Partitions and quorum voting

2.17.1 Failure Modes

For Transactions to be atomic and durable, need to examine
failures

Transaction-local failures, detected by the application which
calls abort eg insufficient funds. No info loss, need
to undo changes made.
Transaction-local failures , not detected by application, but by
system as whole, eg divide by zero. System calls abort.
System failures affecting transactions in progress but not media
eg CPU failure. Loss of volatile store and possibly all
transactions in progress. On recovery, special recovery manager
undoes effects of all transactions in progress at failure.
Media failures affecting database eg head crash. No way of
protecting against this.

2.17.2 Recovery

We assume a machine crashes, but then is fixed and returns to operation^2.5.
We need to recover state to ensure that the guarantees of the
transactional systems are kept.
Use a recovery file or log that is kept on permanent storage.

2.17.2.1 The Recovery Manager

Recovery from failure handled by entity called Recovery Manager.
Keeps information about changes to the resource in a recovery file (also called Log)
kept in stable storage - ie something that will survive failure.
When coming back up after failure, recovery manager looks through
recovery file and undoes changes (or redoes changes) so as uncommitted
transactions didn't happen, and committed transactions happened.
Events recorded on Recovery file for each change to
an object in database.

2.17.2.2 Recovery File

Information recorded per event include:
Transaction IdTo associate change with a transaction
Record IdThe identifier of the object
Action typeCreate/Delete/Update etc
Old ValueTo enable changes to be undone
New ValueTo enable changes to be redone

Also log beginTransaction, prepareToCommit, commit, and abort actions,
with their associated transaction id.

2.17.2.3 Recovering

If after failure,

the database is undamaged, undo all changes made by
transactions executing at time of failure
the database is damaged, then restore database from archive and
redo all changes from committed transactions since archive date.

The Recovery file entry is made and committed to stable storage before
the change is made - incomplete transactions can be undone, committed
transactions redone.

What might happen if database changed before recovery file written?

Note that recovery files have information needed to undo transactions.

2.17.2.4 Checkpointing

Calculation of which transaction to undo and redo on large logs can be
slow.

Recovery files can get too large

Instead, augment recovery file with checkpoint

Force recovery file to stable storage
Write checkpoint record to stable store with
1. A list of currently active transactions
2. for each transaction a pointer to the first record in
  recovery file for that transaction
Force database to disk
Write address of checkpoint record to restart location atomically

2.17.2.5 Recovering with checkpoints

To recover, have undo and redo lists. Add all active transactions at
last checkpoint to undo list

Forwards from checkpoint to end,
1. If find beginTransaction add to undo list
2. If find commit record add to redo list
3. If find abort record remove from undo list
backwards from end to first record in checkpointed transactions,
execute undo for all transaction operations on undo list
Forwards from checkpont to end, redo operations for transactions
on redo list

At checkpoint can discard all recovery file to first logged record in
checkpointed transactions

2.17.2.6 Recovery of the Two Phase Commit Protocol

$\includegraphics{figs/2phasecommit}$

Coordinator uses prepared to signal starting protocol,
commit on signalling DoCommit and done to indicate end
of protocol in recovery file.
Participant uses uncertain to indicate that it has
replied yes to commit request, and commited when it
receives DoCommit.
On recovery, coordinator aborts transactions which reach
prepared, and resends DoCommit when in commit state
but not done
Participant requests decision from coordinator if in
uncertain state, but not commited.

2.17.3 Network Partition

Transactions are often used to keep replicas consistent.

If network partitions (cable breaks), replicas divided into two or more sets
(possibly with common members).

$\includegraphics{figs/partition}$

Can we still write and read from any of the sets?

Yes, but

Must reduce possible read and write sets to maintain consistency
Or relax consistency requirements and resolve problems when
partition is healed

2.17.3.1 Quorum Consensus

Consider set of replicas, where replicated objects have version
numbers at each replica.

Assign a weighting of votes to each replica, indicating importance.
For client to perform operation, it must gather votes for all the
replicas it can talk to (denote X).
X $\ge$ votes set for read quorum R to enable read
X $\ge$ votes set for write quorum W to enable write.
As long as
- W > half the total number of votes
- R + W > total number of votes in group
Each Read quorum and each write quorum will have at least one member
in common.

2.17.3.2 Partition Example

For three replicas, R1, R2, R3, we can allocate votes to give
different properties depending upon requirements

Replica	config 1	config 2	config 3
R1	1	2	1
R2	0	1	1
R3	0	1	1

What should R and W be set to in the three
configurations?
What properties result from these configurations?

2.17.3.3 Read and Write Operations

When write happens, object has a version number incremented

To read, collect votes from replica managers with version numbers of
object. Guaranteed to have at least one up to date copy if in read
quorum, from which read occurs.

To write, collect votes from replica managers with version numbers of
object. If write quorum with up to date copy not discovered, then
copy up to date copy around to create write quorum. Then write is
allowed.

Manipulating R and W give different characteristics eg R = 1 and W = number of copies gives unaminous update.

Cached copies of objects can be incorporated as weak
representatives with 0 votes, but usable for reads.

2.17.4 Summary

Atomicity comes from using logging techniques on operations at
server, where log is kept on stable storage
Voting can be used to give availability for resources on
partitioned replicas.

2009年4月1日星期三

dist-sys-susx.ac.uk