2009年4月1日星期三

dist-sys-susx.ac.uk



























source: http://www.cogs.susx.ac.uk/courses/dist-sys/node1.html



2.1 Introduction


Main Points

  • What we are going to cover
  • What distributed systems are
  • What characteristics well-designed systems have


2.1.1 Course Outline




What are we going to do?

  • 1 lectures on computer communications and networks and loosely
    coupled systems, revising material from the Multimedia
    Communications Technology course. The Postgraduate course will get
    an additional two hours in their seminar slot.
  • 5 lectures on Remote Procedure Call, distributed objects and
    classic distributed systems such as NFS
  • 3 lectures on integrating distributed systems through the web,
    content distribution networks and Peer to Peer computing.
  • 5 lectures on closely coupled systems for distributed
    transactions
  • 1 lecture on recent advances in networks and distributed
    systems

2.1.2 What's a Distributed System




A distributed system: physically separate computers working
together



  • Cheaper and easier to build lots of simple computers
  • Easier to add power incrementally
  • Machines may necessarily be remote, but system should work together
  • Higher availability - one computer crashes, others carry on
    working
  • Better reliability - store data in multiple locations
  • More security - each piece easier to secure to right level.


2.1.2.1 The real world...


In real life, can get:

  • Worse availability - every machine must be up. ``A distributed
    system is one where some machine you've never heard of fails and
    prevents you from working''
  • Worse reliability
  • Worse security

Problem: Coordination is more difficult because multiple
people involved, and communication is over network.


Your Task: What are the distributed interactions when you
login at an x-terminal?


2.1.2.2 Interactions at an X terminal




Simplified interactions




Image login


2.1.3 Example Distributed Systems




Electronic MailMail delivered to remote mailbox. Requires
global name space to identify users, transport mechanisms to get
mail to mailbox
Distributed Information - WWWRemote information hidden below
hypertext browser. Caching and other features operate transparently
Distributed File SystemFiles stored on many machines,
generally not machine you're working on. Files accessed
transparently by OS knowing they're remote and doing remote
operations on them such as read and write e.g. Network File System
(NFS)
Trading Floor SystemBids made, stocks sold, screens updated.

2.1.3.1 Network assumptions



  1. The network is reliable
  2. Latency is zero
  3. Bandwidth is infinite
  4. The network is secure
  5. Topology doesn't change
  6. There is one administrator
  7. Transport cost is zero
  8. The network is homogeneous

(Source:The Eight Fallacies of Distributed Computing -
Peter Deutsch


2.1.4 What do we want from a Distributed System?





  1. Resource Sharing
  2. Openness
  3. Concurrency
  4. Scalability
  5. Fault Tolerance
  6. Transparency

Your Task: order the importance of each of these features for the
example systems in the previous slide.


2.1.5 Elements of a Distributed System


Another way to view a system...

  • Communications system
  • Messages
  • Machines
  • Processes on Machines
  • Programs
  • People


2.1.6 Conclusion



  • Its difficult to design a good distributed system: there are a
    lot of problems in getting ``good'' characteristics, not the least of
    which is people
  • Over the next ten weeks you will gain some insight into how to
    design a good distributed system.


2.2 Bits and Bytes


Main Points

  • Bits and Bytes
  • Kilo, Mega, Giga, Tera
  • Memory, Packets and Bit Manipulation




2.2.1 Bits, Bytes, Integers etc





  • All data in computers is held as a sequence of ones and zeros.
  • A number is represented as the base 2 number held in some
    pre-ordained fixed number of bits
  • Almost all other data is represented as various sets of
    numbers.
  • A byte is 8 bits sequenced together - what is the
    maximum number?
  • Integers (in Java and most other languages nowadays)
    are 32 bits long - what is the maximum number?


2.2.2 Prefices



  • In communications we talk often about throughput in
    bits/second, or moving files of some particular size. We use
    magnitude prefices for convenience
    kilo1000 x

    mega1000000 x

    giga
    109 x

    tera
    1012 x

  • There is often confusion as to whether a kilobyte is 1000 or
    1024 bytes. When dealing with processor architectures, its
    generally 1024. When dealing with communications, its
    generally 1000. State assumptions if it is not obvious from
    context.



2.2.3 Memory and Packets



  • A computer stores data as bits in memory.
  • When it wants to send this data to another computer, it copies
    the bits into the memory of the communications device.
  • The communications device sends the bits through the network
    to the other machine (we'll cover the details of this in the
    coming week).
  • The other machine's communication device places the bits
    into memory which the other machine can access.
  • The tricky bits come in ensuring that both machines
    interpret the bits correctly.

2.2.4 Bit Manipulation



  • Not only must the data be sent, but accompnying information
    allowing the computers to interpret the context of the data.
  • Communications software must be able to pick out arbitrary bits
    from an opaque bit sequence and interpret their meaning.
  • We do this using bitwise operations - and, or, exclusive or,
    negation
    .


2.3 Foundations of Distributed Systems


Aim: How do we send messages between computers across a network?
Physical Concepts:Bandwidth, Latency
Packet Concepts:Data, Headers
Routing Concepts:Shared Media, Switches and Routing Tables,
Routing Protocols

For more detailed notes, see Dan Chalmers' Multimedia Communications Technology notes in http://www.informatics.sussex.ac.uk/courses/mct/.

2.3.1 Physical Concepts


What is a message?

  • A piece of information which needs to move from one process to
    each other eg a request to open a file, an email message, the
    results of a remote print request.


For the network, this is a just a sequence of bits in memory.


Need to communicate this sequence of bits across communications
system to other host.


Signal to other side whether each bit in sequence is
0 or 1.


To communicate, need each end to each access a common substrate, and
then for the sender to change the value of a physical characteristic
of the substrate.


Examples - voltage level of a piece of wire, light level in optical
fibre, frequency of radio wave in air


2.3.1.1 Signal characteristics




If physical characteristic has two levels then signal is binary.


If it has more than one level, can encode a number of bits to each
level.


If 4 levels, then level 0 = 00, level 1 = 01, level 2 = 10, level 3 =
11


If 8 levels, how many bits can be encoded?


2.3.1.2 Bandwidth




Adjusting the level at one end, and recognising the level has changed
at the other takes finite time.


The finite time limits the speed at which bits can be signalled.


The rate of signalling is the bandwidth, measured in bits per second.


If it takes 20 microseconds to raise the voltage on a wire, and 10
milliseconds to recognise the new level, what is the possible
bandwidth if there are two levels? Eight levels?


2.3.1.3 Noise and errors


Can we get infinite bandwidth by increasing the number of levels?


No, since noise makes it impossible to correctly distinguish level.


Noise is random changes in the physical characteristic to which all
physical phenomena are prone.


Always some probability that level will be misinterpreted.


Always some probability of error in message.


Goal of communications engineers is to make this probabilty as low as
necessary.


2.3.1.4 Latency


Does signal propagate along wire infinitely fast?


No, limit to speed of propagation of light. Hence described
as propagation delay.


Latency is time taken for signal to travel from one end of
communication system to destination.


Since communication system may reconstruct message at intermediate
points, time taken to reconstruct message is also part of latency.
Known as switching delay


Your Task: Describe the bandwidth, propagation delay and
switching delay in a game of chinese whispers.

2.3.2 Packets


If there is an error in a message, how is it detected and rectified?



  1. Compute a checksum over the message
  2. send the checksum with the message
  3. Calculate a new checksum over the received message
  4. Compare the checksums - if different, then message is in error
  5. Ask for the message to be resent

Probability of error rises with length of message,


Thus message sent in separate lumps with maximum number of bits
per lump, known as packets.


If the message fits in one packet, good. Otherwise message is in many
packets.

2.3.2.1 Addressing




How do we direct the packet to correct recipient(s)?



  • Put header bits on front of packet, analogous to address and
    other information on envelope in postal system. Add source of
    packet to allow returns
  • Destination | Source | Packet body
  • If destination and source share the same physical medium, then
    destination can listen for its own address (ethernet, wireless,
    single point to point link)
  • If they don't share the same LAN, we use routing


2.3.2.2 Names, addresses and routes




NameIdentifier of object eg Ian Wakeman, Mum
AddressLocation of object eg Rm 4C6, COGS
RouteHow to get there from here - ``turn left, first right and
up stairs, first door on left''



Internet: Name is a Domain Name, such as www.cogs.susx.ac.uk. More
later.


To get packet through network, turn name into address by asking Domain
Name Service (DNS).


Place address in packet header and hand to nearest router.


Router locates MAC address corresponding to IP address, if they're on
the same LAN.

2.3.2.3 Indirect Addressing






\includegraphics{figs/indirect-addresses}

2.3.2.4 Architecture of a switch




\includegraphics{figs/switch}




A switch is a specialised computer. Can turn a PC into a router,
using free software.

  1. A packet arrives on a link
  2. The switch gets the destination of the packet from the packet
    header
  3. The switch looks up the destination in a routing table in
    memory and discovers the output link
  4. The switch queues the packet to be sent on the output link

2.3.2.5 Distributed Routing Table Maintenance


The situation:

  • People are network administrators (and end users)
  • Communication systems are links (possibly multipoint)
  • Machines are switches
  • Messages may be lost

The problem: Given an address, how does a router know which output
link to send the packet on?


Choices:

  1. Packet could contain list of all output links - source routing.
    Requires source to lookup and determine route. May be good thing.
  2. Router could look up address in local routing table, and send out of
    corresponding link. If not in table, send out default link.



How do we construct tables? Could install entries by hand, known as
static routing.

But

  • Limited to entries people know about.
  • Unable to ensure consistency and absence of loops
  • Unable to respond to changing topology, sharks gnawing through undersea
    cables etc.
  • Internet has no centralised authority



So we use distributed routing algorithm



2.3.2.6 Distance Vector Routing



  1. Each switch knows its own address
  2. Each link has''cost'', such as a value of 1 per link, or measure
    of delay.
  3. Each switch starts out with distance vector, consisting of 0 for
    itself, and infinity for everyone else
  4. Switches exchange distance vectors with neighbour switches, and
    whenever info changes
  5. Switch saves most recent vector from neighbours
  6. Switch calculates own distance vector by examining cost from
    neighbour and adding cost of link to neighbour
  7. Use link with minimum cost to destination as link to route out.

Examples include RIP and BGP.


2.3.2.7 Link State Routing



  1. Each switch knows addresses that are direct neighbours
  2. Switch constructs packets saying who are neighbours - link
    state packets
    .
  3. Link state packets flooded to all other switches
  4. Switch constructs complete graph using most recent link state
    packets from all other switches
  5. Use Dijkstra shortest path to figure out routing table.



Examples include OSPF.

2.3.3 Conclusion: Network Properties


Packet Switched Networks present certain fundamental problems to the distributed systems programmer:

  • Whilst switches converge on a consistent set of routes, packets
    will bounce around in the network, and suffer delays or get dropped.
  • Switches shared with other packets, therefore get queues.
  • Total latency is therefore variable.
  • Loss rate is variable (noise, available switch buffers).
  • Queue size management (aka congestion control) changes the bandwidth available to machines. Therefore bandwidth is variable.

2.8 Computer Security: Why you should never trust a
computer system




Goal: Prevent Misuse of computers

  • Definitions
  • Authentication
  • Private and Public Key Encryption
  • Access Control and Capabilities
  • Enforcement of security policies
  • Examples of Security Problems


2.8.1 Definitions




Types of Misuse

  • Accidental
  • Intentional



Protection is to prevent either accidental or intentional misuse


Security is to prevent intentional misuse


Three pieces to security

AuthenticationWho user is
AuthorisationWho is allowed to do what
EnforcementEnsure that people only do what they are allowed to do

A loophole in any of these can cause problem eg

  1. Log in as super-user
  2. Log in as anyone, do anything
  3. Can you trust software to make decisions about 1 and 2?


2.8.2 Authentication


Common approach: Passwords. Shared secret between two parties. Since
only I know password, machine can assume it is me.


Problem 1 system must keep copy of secret, to check against
password. What if malicious user gains access to this list of
passwords?


Encryption Transformation on data that is difficult to
reverse - in particular, secure digest functions.

2.8.2.1 Secure Digest Functions


A secure digest function h = H(M) has the following properties:

  1. Given M, is is easy to compute h.
  2. Given h, it is hard to compute M.
  3. Given M, it is hard to compute M' such that
    H(M) = H(M')



For example: Unix /etc/passwd file


Password
$ \rightarrow$ one way transform
$ \rightarrow$ encrypted
password


System stores only encrypted version, so ok if someone reads the file.
When you type in password, system encrypts password and compares
against the stored encrypted versions.


Over the years, password protection has evolved from DES, using a
well-known string as the input data, and the password as the key,
through to MD5 to SHA-1.


2.8.2.2 Passwords as Human Factors Problem




Passwords must be long and obscure.


Paradox: short passwords are easy to crack, but long
ones, people write down


Improving technology means we have to use longer passwords. Consider
that unix initially required only lowercase 5 letter passwords


How long for an exhaustive search?
265 = 10, 000, 000

  • In 1975, 10 ms to check a password
    $ \rightarrow$ 1 day
  • In 2003, 0.00001 ms to check a password
    $ \rightarrow$ 0.1 second



Most people choose even simpler passwords such as English words - it takes
even less time to check for all words in a dictionary.


2.8.2.3 Some solutions





  1. Extend everyone's password with a unique number (stored in
    passwd file). Can't crack multiple passwords at a time.
  2. Require more complex passwords, eg 7 letter with lower, upper,
    number and special
    707 $ \approx$ 8000 billion, or 1 day. But
    people pick common patterns eg 6 lower case plus number.
  3. Make it take a long time to check each password. For example,
    delay every login attempt by 1 second.
  4. Assign long passwords. Give everyone a calculator or smart card
    to carry around to remember password, with PIN to activate. Need
    physical theft to steal card.



Problem 3 Can you trust the encryption algorithm? Recent
example: techniques thought to be safe such as DES, have back doors
(intentional or unintentional). If there is a back door, you don't
need to do complete exhaustive search.


2.8.3 Authentication in distributed systems: Private Key Encryption




In distributed systems, the network between the machine on which the
password is typed and the machine the password is authenticating on is
accessible to everyone.


Two roles for encryption:

  1. Authentication - do we share the same secret?
  2. Secrecy - I don't want anyone to know this data (eg medical
    records)



Use an encryption algorithm that can easily be reversed given the
correct key, and difficult to reverse without the key


\includegraphics{figs/private}



  • From cipher text, can't decode without password.
  • From plain text and cipher text, can't derive password.
  • As long as the password stays secret, we get both secrecy and
    authentication.

2.8.3.1 Symmetric Encryption


Symmetric encryptions use exclusive ors, addition, multiplication,
shifts and transpositions, all of which are fast on modern processors.
DESThe Data Encryption Standard (DES) was released in 1977.
The 56 bit key is too weak for most uses now, and instead, 3DES or
triple DES is used, which has 128 bit keys.
IDEAThe International Data Encryption Algorithm has 128 bit
keys, and has proven strong against a large body of analysis.
AESThe US based NIST has defined a new encryption algorithm
based on Rijndael, which offers 128, 192 or 256 bit keys.


2.8.3.2 Authentication Server - Kerberos Operation




The server keeps a list of passwords, provides a way for two parties,
A, B to talk to each other, as long as they trust the server.


Notation: Kxy is a key for talking between x and y. (..)$ \wedge$K means
encrypt message (..) with key K.


Simplistic overview:

  • A asks server for key:


    A
    $ \rightarrow$ S (Hi! I'd like a key for A,B)



  • Server gives back special ``session'' key encrypted in A's key, along
    with ticket to give to B:


    S
    $ \rightarrow$ A (Use Kab (This is A! Use Kab)$ \wedge$Ksb)$ \wedge$Ksa



  • A gives B the ticket:


    A
    $ \rightarrow$ B ((This is A! Use Kab)$ \wedge$Ksb


Details: add in timestamps to limit how long each key exists, use
single use authenticators so that clients are sure of server
identities, and to prevent machine replaying messages later.


Also have to include encrypted checksums to prevent malicious user
inserting garbage into message.


2.8.4 Public Key Encryption




Public key encryption is a much slower alternative to private key;
separates authentication from secrecy.


Each key is a pair K,K-1


With private key: (text)$ \wedge$K$ \wedge$K = text


With public key:

  • (text)$ \wedge$K$ \wedge$K-1 = text,
    but (text)$ \wedge$K$ \wedge$K $ \neq$ text



  • (text)$ \wedge$K-1$ \wedge$K = text,
    but (text)$ \wedge$K-1$ \wedge$K-1 $ \neq$ text



Can't derive K from K-1, or vice versa

2.8.4.1 Public key directory




Idea: K is kept secret, K-1 made public, such as public directory


For example: (I'm Ian)$ \wedge$K Everyone can read it, but only I
could have sent it (authentication).


(Hi!)$ \wedge$K-1 Anyone can send it but only I can read it
(secrecy).


((I'm Ian)$ \wedge$K Hi!)$ \wedge$K'-1


On first glance, only I can send it, only you can read it.
What's wrong with this assumption?


Problem: How do you trust the dictionary of public keys? Maybe
somebody lied to you in giving you a key.

2.8.5 Secure Socket Layer





  • Provides a techniques for data sent over a TCP connection to be
    encrypted.
  • Uses public key technology to agree upon a key, then 3DES or
    whatever to encrypt the session.
  • Data encrypted in blocks, optionally with compression first.
  • Used in http as https, and for telnet, ftp etc.

2.8.5.1 SSL handshake protocol




\includegraphics{figs/ssl}


2.8.6 Authorisation




Who can do what...


Access control matrix: formalisation of all the permissions in the
system







































objectsfile1file2file3...
users    
Arwr  
B rw  
C  r 
...    



For example, one box represents C can read file3


Potentially huge number of users and objects, so impractical to store
all of these.


2.8.6.1 Approaches to authorisation





  1. Access control list - store all permissions for all users with
    each object


    Still might be large number of users. Unix addresses by having rwx
    for user group world. Recent systems provide way of specifying
    groups of users and permissions for each group

  2. Capability list - each process stores tickets for the objects it
    can operate on.


2.8.6.2 Digital Rights Management





  • Digital content is produced by people earning a living, and they
    wish to protect their investment
  • Digital Rights Management is the catchall term for the
    technologies controlling the use of digital content.
  • Two main approaches:
    Containmentin which the content is wrapped is an encrypted
    shell, so that users have to prove their capability of
    using the content through knowledge of the key.

    Watermarkingwhere the content is marked so that devices know
    that the data is protected.


  • The problem is how to enforce the checking of
    capabilities and enforce the no circumvention requirement.


2.8.7 Enforcement




Enforcer is the program that checks passwords, access control lists,
etc


Any bug in enforcer means way for malicious user to gain ability to do
anything


In Unix, superuser has all the powers of the Unix kernel - can do
anything. Because of coarse-grained access control, lots of stuff has
to run as superuser to work. If bug in any program, you're in
trouble.


Paradox:

  1. make enforcer as small as possible - easier to get correct, but
    simple minded protection model (more programs have high privilege).
  2. fancy protection - only minimal privilege, but hard to get
    right.

2.8.8 Trusted Computing Platform




Question: How do we ensure no software transgresses digital
rights?


Answer: By only allowing approved software to have access to
the data.



  • The basic computer and operating system must be running
    validated software.
  • There must be an untamperable part of the computer/OS that can
    hold keys and hand over only to validated software.
  • The untamperable part of the computer is the Fritz chip, a
    co-processor which holds a unique certificate that it is running
    some validated code

2.9 Names and Naming Services


2.9.1 Main Points





  • Use of names
  • Structure of names
  • Name Services
  • Domain Name Service (DNS) - The Internet Name Service



Definitions:-

Names- what its called
Address- where it is
Route- how to get there


2.9.2 Why names?




Object Identificationa service or resource we want to use, eg
a filename, a telecommunications provider, a person
Allow SharingCommunicating processes can pass names and thus
share resources
Location IndependenceIf we separate the name from the address,
can migrate object transparently
SecurityIf large number of possible names, knowing the name of
the object means that it must explicitly have been passed. If
entire system constructed so that names are passed with
authorisation then knowing name means chain of trust to allow
access to object.


2.9.3 What does one do with names?



  • Use as arguments to functions, eg to call a service
  • Create names. If an object comes into creation it must be
    named. Name should meet rules for system, eg be unique, therefore
    naming authority (possibly object) must keep track of what names it
    has allocated
  • Delete names. When an object disappears from the system, at
    some point may wish to delete name and allow re-use.


2.9.4 What's a name?





  • Part of the design space for any distributed system is how to
    construct the name.
  • Choice of design has ramifications for design of rest of system
    and performance of service.


2.9.4.1 Unique Identifier





  • Unique Identifier (UID) aka flat names, primitive names.
  • No internal structure, just string of bits.
  • Only use is comparison against other UIDs eg for lookup in table
    containing information on named object.
  • Provide location independence and uniformity
  • But Difficult to name different versions of the same
    object eg Distributed systems edition 1 vs Distributed Systems
    edition 2. Real objects have relationships to each other - useful
    to reflect in naming practice
  • Difficult to discover address from name - need to search entire
    name space


2.9.5 Partitioned names


Add partition structure to name to enable some function to be
more efficient, typically location and name allocation


\includegraphics{figs/dns}



  • Domain Name Service (DNS) name - unix.rn.informatics.scitech.sussex.ac.uk.
    Each part of name comes from flat name space. Division of name space
    and possible objects into smaller space. "uk" reduces objects to
    those within the uk, "ac" to those administered by academic
    networks, and so on.



  • When allocating names, simply add to lowest part of partition.
    Low risk of collision, since small number of other objects, all
    administered by same authority



  • When looking up name, can guess where information will reside.


2.9.6 Descriptive names



  • Necessary to have a unique name.



  • But useful to name objects in different ways, such as by service
    they offer eg Postman Pat, John of Gwent, www.informatics.sussex.ac.uk.



  • Create name using attributes of object.



  • Note that objects can therefore have multiple names, not all of
    which are unique



  • Choice of name structure depends on system. DNS chooses
    partition according to administration of creation, with aliases to
    allow naming by service eg ftp.informatics.sussex.ac.uk is also
    doc-sun.rn.informatics.scitech.sussex.ac.uk


2.9.7 Object Location from name - broadcast



  • Ask all possible objects within the system if they respond to
    that name. Can be efficient if network supports broadcast eg
    Ethernet and other LANs.



  • Equivalent to distributing name table across all objects,
    objects storing names referring to them.



  • Only want positive responses - all responses would generate a
    lot of traffic.



  • Scaling problem when move into wide area.

    • Greater number of hosts imply greater probability of failure



    • Broadcasts consume higher proportion of bandwidth, made worse
      due to higher failures needing more location requests




  • Broadcast used only on small LAN based systems (and in initial
    location of directory)


2.9.8 Location through database



  • Keep information about object in a table, indexed by a name. Location
    is just another piece of information.



  • In DNS, table is stored as {name,attribute} pairing, in a resource record



  • If database centralised, then

    • Whole system fails if database machine fails
    • Database machine acts as bottleneck for performance of system as
      whole
    • In wide area systems, authority should be shared amongst
      controlling organisations



    So name information usually distributed.


2.9.9 Distributed Name Servers


Parts of the name table are distributed to different servers eg. in
Domain Name Service, servers are allocated portions of the name space
below certain domains such as the root, ac.uk, susx.ac.uk


Names can be partitioned between servers based on

algorithmic clusteringeg apply well-known hash function on name to map to server. May
result in server being remote from object. Only technique for UIDs
structural clusteringif name is structured, use structure to designate names to particular server, such as in DNS
attribute clusteringif names are constructed using attributes, then servers take responsibility for certain attributes.


2.9.10 Availability and performance


If a single server is responsible for name space, there is still single
point of failure, and a performance bottleneck.


Most systems therefore

  • Replicate name list to other servers. Also increases
    performance for heavily accessed parts of name space eg secondary
    servers in DNS
  • Cache information received by lookup. No need to repeat lookup
    if asked for same information again. Increases performance.
    Implemented in both client side and server (in recursive calls) in
    DNS.



If information is cached, how do we know when its invalid? May
attempt to use inconsistent information.


2.9.11 Maintaining consistency for distributed name services


Alleviated by following features of some distributed systems

  • In most systems objects change slowly, so names live for a long
    time, and are created infrequently
  • If address of an object is wrong, it causes an error. Address user
    can recover if it assumes one of the possible problems is inconsistent
    information.
  • Obsolete information can be fixed by addressed object leaving
    redirection pointers. Equivalent to leaving a forwarding address to new home.



However, there are always systems which break these assumptions eg
highly dynamic distributed object system, creating lots of objects and
names and deleting lots of names.


2.9.12 Client and Name server interaction.



  • Hidden behind RPC interface.



  • Client knows of server to ask (either installed in file, or
    through broadcast location).



  • Client calls with arguments of name and required attribute, eg
    address. In DNS arguments are name and the type of requested
    attribute.



  • Server will return result with either required attribute or
    error message


2.9.12.1 Lookup Modes




If name not stored on server, may be stored on other server. Two options

Recursiveserver asks other possible server about name and
attribute, which may then have to ask another server and so on


\includegraphics{figs/recursive}



Iterativeserver returns address of other possible server to client, who then resends request to new server.


\includegraphics{figs/iterative}


2.9.13 Summary



  • Names can be flat, or they can be structured



  • Centralised name servers suffer from availability - distributed
    name servers suffer from inconsistency



  • Interaction with name server best modelled by RPC

2.10 Distributed File Systems


2.10.1 Main Points


A Distributed File System provides transparent access to files
stored on a remote disk


Themes:

  • Failures - what happens when server crashes but client doesn't?
    or vice versa
  • Performance
    $ \Rightarrow$ Caching - use caching at both the
    clients and the server to improve performance
  • Cache Coherence - how do we make sure each client sees most up
    to date copy of file?


2.10.2 Client Implementation



  • Request for operation on file goes to OS.
  • OS recognises file is remote and constructs RPC to remote file server
  • Server receives call, does operation and returns results.
  • Subtrees in directory structure generally map to file system.
    Provides access transparency


    \includegraphics{figs/remote-dir}


2.10.3 No Caching



  • Simple approach: Use RPC to forward each file system request to remote
    server (older versions of Novell Netware).
  • Example operations: open, seek, read, write, close
  • Server implements operations as it would for local request and passes
    result back to client


    Image simple-fs


2.10.3.1 Advantages and Disadvantages of uncached remote file
service


Advantageserver provides consistent view of file system to both A
and B.
Disadvantagecan be lousy performance

  • Going over network is slower than going through memory
  • Lots of network traffic - congestion
  • Server can be a bottleneck - what if lots of clients


2.10.4 NFS - Sun Network File System


Main idea - uses caching to reduce network load

  • Cache file blocks, file headers etc in both client and
    servers memory.
  • More recent NFS implementations use a disk cache at the client
    as well.



\includegraphics{figs/nfs-fs}

Advantage
Advantage: If open/read/write/close can be done locally, no network
traffic

Issues: failure and cache consistency


2.10.4.1 Failures


What if server crashes? Can client wait until server comes back up
and continue as before?

  1. Any data in server memory but not yet on disk can be lost
  2. If there is shared state across RPCs, eg open seek read. What
    if server crashes after seek? Then when client does ``read'' it will
    fail.
  3. Message re-tries - suppose server crashes after it does ``rm
    foo'' but before it sends acknowledgement? Message system will
    retry and send it again. How does system know not to delete it
    again (someone else may have created new ``foo''). Could use
    transactions, but NFS takes more ad hoc approach.

What if client crashes?

  1. Might lose modified data in client cache.


2.10.4.2 NFS Protocol



  1. Write-through caching - when a file is modified, all modified
    blocks are sent immediately to the server disk. To the client,
    ``write'' doesn't return until all bytes are stored on disk.
  2. Stateless Protocol - server keeps no state about client, except
    as hints to improve performance

    • Each read request gives enough information to do entire operation -
      ReadAt(inumber, position) not Read(openFile).
    • When server crashes and restarts, can start processing requests
      immediately, as if nothing had happened.


  3. Operations are idempotent: all requests are ok to repeat
    (all requests are done at least once). So if server crashes between
    disk I/O and message send, client can resend message and server
    does request over again

    • Read and write file blocks are easy - just re-read or re-write file
      block, so no side-effects.
    • What about ``remove''? NFS just ignores this - does the remove twice
      and returns error if file not found, application breaks if
      inadvertently removes other file.


  4. Failures are transparent to client system.


    Is this good idea? What should happen if server crashes? If
    application in middle of reading file when server crashes, options:

    1. Hang until server returns (next week...)
    2. return an error? But networked file service is transparent.
      Application doesn't know network is there. Many Unix Apps ignore
      errors and crash if there is a problem.

    NFS has both options - the administrator can select which one.
    Usually hang and return error if absolutely necessary.


2.10.4.3 Cache consistency



  • What if multiple clients sharing same files? Easy if they are both
    reading - each gets a local copy in their cache.
  • What if one writing? How do updates happen?
  • Note NFS has write-through cache policy. If one client modifies file,
    writes through to server.
  • How does other client find out about change?


2.10.4.4 NFS and weak consistency



  • In NFS, client polls server periodically to check if file has changed.
    Polls server if data hasn't been checked in every 3-30 seconds (Exact
    time is tunable parameter)


    \includegraphics{figs/nfs-cache}



  • When file changed on one client, server is notified, but other clients
    use old version of file till timeout. They then check server and get
    new version.
  • What if multiple clients write to same file? In NFS get either
    version or mixed version. Completely arbitrary!


2.10.4.5 Sequential ordering constraints



  • What should happen? If one CPU changes file, and before it completes,
    another CPU reads file?
  • We want isolation between operations, so read
    should get old file if it completes before write starts, new version if it
    starts after write completes. Either all new or all old any other
    way cf serialisability.


2.10.4.6 NFS Pros and Cons





  • its simple
  • its highly portable
  • but its sometimes inconsistent
  • but it doesn't scale to large numbers of clients



Note that this describes NFS version 3.


2.10.5 Andrew File System


AFS (CMU late 80s)
$ \Rightarrow$ DCE DFS (commercial product)

  1. Callbacks: Server records who has copies of file
  2. Write through on close

    • If file changes, server is updated (on close)
    • Server then immediately tells all those with old copy



2.10.5.1 AFS Session Semantics


Session semantics - updates only visible on close

  • In Unix (single machine), updates visible immediately to other
    programs who have file open.
  • In AFS, everyone who has file sees old version, anyone who opens file
    again will see new version.


    In AFS:

    1. on open and cache miss: get file from server, set up callback
    2. on write close: send copy to server; tells all clients with
      copies to fetch new version from server on next open

  • Files cached on local disk; NFS (used to) cache only in memory



\includegraphics{figs/AFS-cache}

  • What if server crashes? Lose all your callback state.
  • Reconstruct information from client - go ask everyone ``who has which
    files cached''


2.10.5.2 AFS Pros and Cons





  • Disk as cache
    $ \Rightarrow$ more files cached locally
  • Callbacks
    $ \Rightarrow$ server not involved if file is
    read-only (Majority of file access is read-only)
  • But on fast LANs, local disk is slower than remote memory



NFS version 4 will provide session semantics when it is
deployed by vendors in the 2005 timeframe.


2.10.6 Summary





  • Remote file performance needs caching to get decent performance.



  • Central server is a bottleneck

    • Performance bottleneck:

      • All data is written through to server
      • all cache misses go to server


    • Availability Bottleneck

      • Server is single point of failure


    • Cost bottleneck

      • Server machines high cost relative to workstation



2.12 Content Distribution Networks


Main Points

  • Building content caches
  • Pre-fetching data
  • Using your neighbours - BitTorrent

2.12.1 Getting Content over a Network




Image cdn-base



  • Users want to download content from serversas quickly as possible
  • What structures can we use to improve their experience, and the
    impact upon the network?

2.12.2 Web Caches


Image cdn-cache

  • Large organisations can improve web performance by sticking a
    web cache in front of HTTP connections
  • The cache inspects an incoming HTTP request to see if it can
    satisfy the request from locally cached objects.
  • If yes, then the object is returned, and the last reference time
    is updated
  • If no, then the object is retrieved and copied locally if allowed.
  • Web objects can be marked as non-cacheable.
  • Caching is a major part of the HTTP standard

2.12.2.1 Cache Performance




  • The performance of a web cache is difficult to model, since the
    performance is a complex mixture of interaction between TCP, HTTP
    and content.
  • Caches work because of temporal locality, due to popularity of
    content, and spatial locality, due to structure of HTML documents
  • Measurements of web proxies give the following results (based on
    JANET caches)

    • Request hit rate is about 60%.
    • Volume hit rate is about 30%.
    • Latency improvement is around a factor of 3 on average



2.12.2.2 Problems with Caching



  • Not all content is marked as cacheable, eg because the site
    wants accurate records of who looks at content.
  • All hosts behind a cache appear to come from one
    address.

    Question: Why is this a problem?

2.12.3 Pre-fetching Data



  • Can we improve performance by pro-actively distributing content
    to caches?
  • Yes...

2.12.3.1 Active Content Distribution



  • The html uses links to the cdn domain name eg akamai.com
  • The internet service provider has entries in their local DNS for
    akamai.com pointing to a machine on the ISP's network.
  • Content will therefore be supplied from the local machine rather
    than the original machine
  • Customer interaction improved.
  • Bandwidth requirements of servers reduced

2.12.4 Using your Peers: BitTorrent


Image cdn-torrent

  • Why not use the other people receiving the content?
  • BitTorrent downloads from other machines
  • Basic Idea:

    • To download, the host contacts a machine tracking those already
      downloading the torrent.
    • Peers are selected at random from the tracker.
    • Pieces are selected to download from those available on the
      downloader's peer set until all pieces of the file have been received

2.12.4.1 The .torrent file



  • To start downloading from a torrent, the consumer must
    first locate a .torrent file.
  • This contains information about the file length, name and
    hashing numbers of the file blocks, and the url of a tracker
  • The file is split into 250 KByte pieces, each having a SHA1 hash
    calculated.
  • A tracker holds the IP addresses of current downloaders

2.12.4.2 Locating peers



  • After receiving the torrent file, the downloader contacts the
    tracker
  • The tracker inserts the downloader into its list of downloaders,
    and returns a random list of other downloaders
  • This list becomes the downloader's peers
  • Question What is the shape of the overlay Graph?

2.12.4.3 Choosing Pieces



  • The downloader will contact its peers to discover what pieces
    they have.
  • It then chooses a piece to download:
  • The first choice is made randomly, so as to spread load
  • Subsequent pieces are based on a rarest-first approach to
    increase probability all pieces are available.
  • When a peer has downloaded a new piece which matches the SHA1
    hash, it notifies its peers it has a new piece.

2.12.4.4 Choosing Downloaders



  • A request to upload is accepted if the requester recently
    uploaded to it
  • This provides an incentive to machines to share data
  • Periodically other machines are tried for upload

2.12.5 Summary



  • Web caching improves performance by a reasonable factor,
    dependent on situation
  • Pro-active content distribution can reduce latency and improve
    bandwidth usage for popular services
  • BitTorrent can improve bandwidth usage by spreading load across
    peers.

2.13 Replication: Availability and Consistency



  • Motivation for replication
  • Multicasting updates to a group of replicas
  • Total Ordering
  • Causal Ordering
  • Techniques for ordering protocols
  • ISIS CBCAST


2.13.1 What is Replication?




  • Multiple copies of dynamic state stored on multiple machines
    eg Copies of files stored on different machines, name servers storing
    name address mappings

  • Caching can be seen as a form of replication.


2.13.1.1 Why is Replication used?


Performance enhancement



  • Single Server acts as a bottleneck - if we can balance load
    amongst multiple servers, get apparent performance gain
  • If clients are geographically distributed, we can site servers
    near clients and reduce communication costs


Availability



  • If a machine fails, then we can still provide a service
  • Probability of total failure reduced such as all data being
    lost, since data replicated across multiple machines
  • If probability of failure is pr(fail ) for a given machine
    in n machines, then probability of loss of service is
    pr(fail )n and the availability of the service is
    1 - pr(fail )n
  • eg, if mean time between failure for 3 machines is 5 days,
    repair time is four hours, then assuming independence of failure,

    pr(fail )= $ {\frac{{4}}{{5 \times 24}}}$ = 0.03.


    Availability =
    1 - 0.033 = 99.996%





Fault ToleranceEven in the presence of failure, the service
will continue to give the correct service

  • Stronger than availability, since can provide real-time
    guarantees (with extra work!)
  • Can protect against arbitrary failure where machines feed
    wrong information (Byzantine Failure)


2.13.2 Issues in Replication


A collection of replicas should behave as if state was stored
at one single site

  • When accessed by client, view should be consistent
  • Replication should be transparent - client unaware that servers
    are replicated



If we are providing a replica service, replica can be passive or
active.



Passive replicas are standbys, to maintain service on failure. No
performance improvement.


Standbys must monitor and copy state of active server


Provide availability in simple manner.


Used for highly available systems eg space applications


2.13.3 Consistency





  • Clients can modify resource on any of the replicas.
  • What happens if another client requests resource before replica
    has informed others of modification, as in cache consistency in
    distributed file systems?
  • Answer depends upon application...


2.13.3.1 Example Distributed Bulletin Board System (BBS)


\includegraphics{figs/bbs}

  • Users read and submit articles through Front End.
  • Articles replicated across a number of servers
  • Front Ends can connect to any server
  • Servers propagate articles between themselves so that all
    servers hold copies of all articles.
  • User membership of a given bbs is tightly controlled.



Questions on BBS:

  • How should messages be passed between replicas?
  • Should order of presentation of articles to clients be the same
    across all replicas? Are weaker ordering semantics possible?
  • When a client leaves bbs group, can they see articles submitted
    after they have left? Is this desireable?
  • What should happen when replicas are temporarily partitioned?


2.13.4 Updating Server state




Clients read and update state at any of the replicated servers eg
submit messages in bbs. To
maintain consistency, three things are important

Multicast communicationMessages delivered to all servers in
the group replicating data
Ordering of messagesUpdates occur in the same ``order'' at
each server
Failure recoveryWhen servers or the network fails, and comes
back, the replicas must be able to regain consistency. Done through
Voting and Transactions (later in course)


2.13.5 Multicast and Process Groups




A Process Group: a collection of processes that co-operate
towards a common goal.


Multicast communication: One message is sent to the members of a
process group


Idea: Instead of knowing address of process, just need to know an
address representing the service. Lower levels take care of routing
messages.


Useful for:

Replicated ServicesOne update message goes to all replicas,
which perform identical operations. Reduces communication costs.
Locating objects in distributed servicesRequest for object
goes to all processes implementing service, but only process holding
object replies.


2.13.5.1 Group Services




Maintenance of group information is a complex function of the name
service (for tightly managed groups)

Create GroupCreate a group identifier that is globally
unique.
Join GroupJoin a group. Requires joining process information
to be disseminated to message routing function. May require
authentication and notification of existing members.
Leave GroupRemove a process from a group. May require
authentication, may occur as a result of failure or partition. Need
to notify message routing function, may notify other members.
Member ListSupply the list of processes within a group.
Needed for reliable message delivery, may require authentication.


2.13.6 Message Ordering




If two processes multicast to a group, the messages may be arbitrarily
ordered at any member of the group.


Process P1 multicasts message a to a group comprising processes P1, P2, P3 and
P4.


Process P2 multicasts message b to the same group


The order of arrival of a and b at members of the group can be different.


\includegraphics{figs/order}


2.13.6.1 Ordering example




\includegraphics{figs/causal-object}

  • Order of operations may be important - delete object, create
    object.
  • If delete object arrives before create object, then operation
    not completed


2.13.6.2 Ordering Definitions




Various definitions of order with increasing complexity in
multicasting protocol

FIFO OrderingMessages from one process are processed at all
group members in same order
Causal OrderingAll events which preceded the message
transmission at a process precede message reception at other
processes. Events are message receptions and transmissions.
Total OrderingMessages are processed at each group member in
the same order.
Sync OrderingFor a sync ordered message, either an event
occured before message reception at all processes, or after
message. Other events may be causally or totally ordered.


2.13.6.3 FIFO ordering




Achieved by process adding a sequence number to each message.


Group member orders incoming messages with respect to sequence number.


Applicable when each process state is separate, or operations don't
modify state, just add incremental updates or read.


2.13.6.4 Total Ordering




When several messages are sent to a group, all members of the group
receive the messages in the same order.


Two techniques for implementation:

Sequencer
      Elect a special sequencing node. All messages are
sent to sequencer, who then sends messages onto replicas. FIFO
ordering from sequencer guarantees total ordering. Suffers from
single point of failure (recoverable by election) and bottleneck.

Holdback Queue
      
Received messages are not passed to the
application immediately, but are held in a holdback queue
until the ordering constraints are met.


2.13.6.4.1 Sequence Number Negotiation

Sender negotiates a largest
sequence number with all replicas.

  1. Replicas store largest final sequence number yet seen Fmax,
    and largest proposed sequence number Pmax
  2. Sender sends all replicas message with temporary ID.
  3. Each Replica i replies with suggested sequence number of
    max(Fmax, Pmax) + 1. Suggested sequence number
    provisionally assigned to message and message placed in holdback
    queue (ordered with smallest sequence number at front)
  4. Sending site chooses largest sequence number and notifies
    replicas of final sequence number. Replicas replace provisional
    sequence number with final sequence number.
  5. When item at front of queue has an agreed final sequence number,
    deliver the message.



2.13.6.5 Causal Ordering




``Cause'' means ``since we don't know application, messages might have
causal ordering''


a and b are events, generally sending and receiving of
messages.


We define the causal relation,
a $ \rightarrow$ b, if

  1. if a and b are events at the same process,

    a $ \rightarrow$ b implies a happened before b
  2. if a is a message sent by process P1 and b is
    the arrival of the same message at P2, then
    a $ \rightarrow$ b is true



In bulletin board, an article titled ``re: Multicast Routing'' in
repsonse to an article called ``Multicast Routing'' should always come
after, even though may be received before the initial article


2.13.6.6 CBCAST - Causal ordering in ISIS


ISIS is a real commercial distributed system, based on process groups.


Causal ordering for multicast within a group is based around
Vector Timestamps


The vector VT has an identifier entry for each member of the
group, typically an integer.


Vector timestamps have one operation defined



merge(u, v)[k] = max(u[k], v[k]), for k = 1..n


Incoming messages are placed on a holdback queue, until all messages
which causally precede the message have been delivered.



2.13.6.6.1 CBCAST Implementation





  1. All processes pi initialise the vector to zero
  2. When pi multicasts a new message, it first increments VTi[i] by 1; it piggybacks
    vt = VTi on the message



  3. Messages are delivered to the application in process Pj when

    • The message is the next in sequence from pi i.e.
      vt[i] = VTj[i] + 1
    • All causally prior messages that have been delivered to pi must have been delivered to pj, i.e.
      VTj[k] $ \geq$ vt[k] for k $ \neq$ i.


  4. When a message bearing a timestamp vt is delivered to pj, pj's timestamp is updated as
    VTj = merge(vt, VTj)



In words

  • Incoming vector timestamp is compared to current timestamp.
  • If conditions for delivery to process not met, then message placed on
    holdback queue.
  • When an incoming message is delivered, the timestamp is updated by the
    merge.
  • Examine all messages in the holdback queue to see if they can be
    delivered.
  • CBCAST requires reliable delivery.


2.13.6.6.2 Causal Example


\includegraphics{figs/causal}


2.13.6.6.3 Group View Changes



  • When group membership changes, what set of messages should be
    delivered to members of changed group?
  • What happens to undelivered messages of failed members?
  • What messages should new member get?
  • ISIS solves by sending a sync ordered message announcing
    that the group view has changed. Messages thus belong to a
    particular group view.
  • Use coordinator to decide which messages belong to which view.


2.13.7 Summary



  • Replication of services and state increase availability
  • Replication increases performance
  • Replication increases Fault tolerance
  • To maintain consistency, multicast updates to all replicas
  • Use sequence numbers to maintain FIFO ordering
  • Use Vector Timestamps to maintain Causal Ordering
  • Use elected sequencers or identifier negotiation to maintain
    total ordering



2.14 Shared Data and Transactions





  • Stateful Servers
  • Atomicity
  • Transactions
  • ACID
  • Serial Equivalence


2.14.1 Servers and their state





  • Servers manage a resource, such as a database or a printer
  • Attempt to limit problems of distributed access by making server
    stateless, such that each request is independent of other
    requests.

    • Servers can crash in between servicing clients
    • Client requests cannot interfere with each other (assuming
      concurrency control in server


  • But we can't always design stateless servers...


2.14.1.1 Stateful Servers



  • Some applications better modelled as extended
    conversations, eg retrieving a list of records in a
    large database better modelled as getting batch of records at a
    time.
  • If application requires state to be consistent across a number
    of machines, then each machine must recognise when it can update
    internal data. Needs to keep track of state of distributed
    conversation
  • If long duration then, then need to be aware of state.
  • If other conversations need to go on - eg modify records during
    retrieval, how do we allow concurrency?
  • What happens if machine fails - need to recover.
  • Should also aim to be fault tolerant


2.14.2 Atomicity


Stateful server have two requirements

  1. Accesses from different clients shouldn't interfere
    with each other
  2. Clients should get fast access to the server


2.14.2.1 Definition


We define atomicity as
All or NothingA client's operation on a server's resource
should complete successfully, and the results hold thereafter (yea,
even unto a server crash), or it should fail and the resource should
show no effect of the failed operation
IsolationEach operation should proceed without interference
from other clients' operations - intermediate effects should not be
visible.


2.14.2.2 Example


Mutual ExclusionFor a multi-threaded server, if two or more
threads attempt to modify the same piece of data, then the updates
should have mutual exclusion around the updates to provide
isolation, using semaphores or monitors
SynchronisationIn situations such as Producer Consumer, need
to allow one operation to finish so second operation can use
results, needing isolation.


2.14.3 Automatic Teller Machines and Bank accounts




\includegraphics{figs/cashmachine}



  • An ATM or cashmachine allows transfer of funds between accounts.
  • Accounts are held at various machines belonging to different
    banks
  • Accounts offer the following operations
    depositPlace an amount of money in an account

    withdrawTake an amount of money from an account

    balanceGet the current value in an account


  • Operations implemented as read() and write() of values, so
    withdraw x from A and deposit x in B implemented as

    1. A.write( A.read()  - x)
    2. B.write( B.read() + x)




2.14.4 Transactions


Transactions are technique for grouping operations on data so that
either all complete or none complete


Typically server offers transaction service, such as:

beginTransaction(transId)Record the start of a transaction and
associate operations with this transId with this transaction.
commitTransaction(transId)Commit all the changes the
operations in this transaction have made to permanent storage.
abortTransaction(transId)Abort all the changes the transaction
operations have done, and roll back to previous state.



2.14.4.1 ACID




Transactions are described by the ACID mnemonic

AtomicityEither all or none of the Transaction's operations
are performed. If a transaction is interrupted by failure, then
partial changes are undone
ConsistencySystem moves from one self-consistent state to another
IsolationAn incomplete transaction never reveals partial state
or changes before commiting
DurabilityAfter committing, the system never loses the results
of the transaction, independent of any subsequent failure

2.14.4.2 Concurrency Problems

2.14.4.2.1 Lost Update






























































Transaction T Transaction U 
A.withdraw(4,T) C.withdraw(3,U) 
B.deposit(4,T) B.deposit(3,U) 
balance = A.read()£100  
A.write(balance - 4)£96  
  balance = C.read()£300
  C.write(balance - 3)£297
balance = B.read()£200  
  balance = B.read()£200
  B.write(balance + 3)£203
B.write(balance + 4)£204  




2.14.4.2.2 Inconsistent Retrievals


































































Transaction T Transaction U 
A.withdraw(100,T) Bank.total(U) 
B.deposit(100,T)   
balance = A.read()£200  
A.write(balance - 100)£100  
  balance = A.read()£100
  balance = B.read()£300
  + balance 
  balance = C.read()£300+
  + balance 
balance = B.read()£200  
B.write(balance + 100)£300  




2.14.5 Serial Equivalence


Definition: Two transactions are serial if all the operations in
one transaction precede the operations in the other.


eg the following actions are serial



Ri(x)Wi(x)Ri(y)Rj(x)Wj(y)


Definition: Two operations are in conflict if:

  • At least one is a write
  • They both act on the same data
  • They are issued by different transactions




Ri(x)Rj(x)Wi(x)Wj(y)Ri(y) has
Rj(x)Wi(x) in conflict


Definition: Two schedules are computationally equivalent if:

  • The same operations are involved (possibly reordered)
  • For every pair of operations in conflict, the same operation
    appears first in each schedule

So, a schedule is serialisable if the schedule is computationally
equivalent to a serial schedule.


Question: Is the following schedule for these two transaction serially
equivalent?



Ri(x)Ri(y)Rj(y)Wj(y)Ri(x)Wj(x)Wi(y)



2.14.5.1 Transaction Nesting




Transactions may themselves be composed of multiple transactions


eg Transfer is a composition of withdraw and
deposit transactions, which are themselves composed of read and
write transactions


Benefits:

  • Nested transactions can run concurrently with other transactions
    at same level in hierarchy
  • If lower levels abort, may not need to abort whole transaction.
    Can instead use other means of recovery.


2.14.6 Summary





  • Transactions provide technique for managing stateful servers
  • Need to worry about concurrency control
  • Need to worry about aspects of distribution
  • Need to worry about recovery from failure


2.15 Concurrency Control and Transactions





  • Problem restatement
  • Locking
  • Optimistic control
  • Timestamping


2.15.1 Why concurrency control?





  • To increase performance, multiple transactions must be able to
    carry on work simultaneously...
  • ...but if data is shared, then can lead to problems such as lost
    updates and inconsistent retrievals.
  • So we must ensure schedules of access to data for concurrent
    transactions are computationally equivalent to a serial schedule of
    the transactions.


2.15.2 Locking





  • As in operating systems, locks control access for different
    clients
  • Granularity of data locked should be small so as to maximise
    concurrency, with trade-off against complexity.
  • To prevent intermediate leakage, once lock is obtained, it must
    be held till transaction commits or aborts


2.15.2.1 Conflict rules





  • Conflict rules determine rules of lock usage
  • If operations are not in conflict, then locks can be shared
    $ \Rightarrow$ read locks are shared
  • Operations in conflict imply operations should wait on lock
    $ \Rightarrow$ write waits on read or write lock, read waits on write
    lock
  • Since can't predict other item usage till end of transactions,
    locks must be held till transaction commits or aborts.
  • If operation needs to do another operation on same data then
    promotes lock if necessary and possible - operation may
    conflict with existing shared lock


2.15.2.2 Rules for strict two phase locking





  1. When operation accesses data item within transaction

    1. If item isn't locked, then server locks and proceeds
    2. If item is held in a conflicting lock by another transaction,
      transaction must wait till lock released
    3. If item is held by non-conflicting lock, lock is shared and
      operation proceeds
    4. If item is already locked by same transaction, lock is
      promoted if possible (refer to rule b)


  2. When transaction commits or aborts, locks are released


2.15.2.3 Locking Implementation





  • Locks generally implemented by a lock manager
    lock(transId,DataItem,LockType)Lock the specified item if
    possible, else wait according to rules above

    unLock(transId)Release all locks held by the transaction


  • Lock manager generally multi-threaded, requiring internal
    synchronisation
  • Heavyweight implementation


2.15.2.4 Example


Transactions T and U.

  • T:
    RT(i), WT(j, 44)
  • U:
    WU(i, 55)), RU(j), WU(j, 66)

Question What are the possible schedules allowed under strict
locking?


Question Are there any schedules computationally equivalent
to a serial schedule which are disallowed?



2.15.2.5 Deadlocks





  • Locks imply deadlock, under following conditions

    1. Limited access (eg mutex or finite buffer)
    2. No preemption (if someone has resource can't take it away)
    3. Hold and wait. Independent threads must possess some of its needed resources and waiting for the remainder to become free.
    4. Circular chain of requests and ownership.

  • Most common way of protecting against deadlock is through
    timeouts. After timeout, lock becomes vulnerable and can be
    broken if another transaction attempts to gain lock, leading to
    aborted transactions


2.15.2.6 Drawbacks of Locking





  • Locking is overly restrictive on the degree of concurrency
  • Deadlocks produce unnecessary aborts
  • Lock maintenance is an overhead, that may not be required


2.15.3 Optimistic Concurrency Control



  • Most transactions do not conflict with each other
  • So proceed without locks, and check on close of transaction that
    there were no conflicts

    • Analyse conflicts in validation process
    • If conflicts could result in non-serialisable schedule, abort
      one or more transactions
    • else commit




2.15.3.1 Implementation of Optimistic Concurrency Control


Transaction has following phases

  1. Read phase in which clients read values and acquire tentative
    versions of items they wish to update
  2. Validation phase in which operations are checked to see if they
    are in conflict with other transactions - complex part. If invalid,
    then abort.
  3. If validated, tentative versions are written to permanence, and
    transaction can commit (or abort).


2.15.3.2 Validation approaches





  • Validation based upon conflict rules for serialisability
  • Validation can be either against completed transactions or
    active transactions - backward and forward validation.
  • Simplify by ensuring only one transaction in validation and
    write phase at one time
  • Trade-off between number of comparisons, and transactions that
    must be stored.


2.15.3.2.1 Forward Validation



  1. A transaction in validation is compared against all
    transactions that haven't yet committed
  2. Writes may affect ongoing reads
  3. The write set of the validating transaction is compared
    against the read sets of all other active transactions.
  4. If the sets conflict, then either abort validating
    transaction, delay validation till conflicting transaction
    completes, or abort conflicting transaction.


2.15.3.2.2 Backward validation



  1. Writes of current transaction can't affect previous
    transaction reads, so we only worry about reads with overlapping
    transactions that have committed.
  2. If current read sets conflict with already validated
    overlapping transactions write sets, then abort validating
    transaction

2.15.4 Timestamping


Operates on tentative versions of data

  • Each Transaction receives global unique timestamp on initiation
  • Every object, x, in the system or database carries the maximum (ie
    youngest) timestamp of last transaction to read RTM(x)2.3 and maximum of last transaction to write
    WTM(x)2.4
  • If transaction requests operation that conflicts with younger
    transaction, older transaction restarted with new timestamp.
  • Transactions committed in order of timestamps, so a transaction
    may have to wait for earlier transaction to commit or abort before
    committing.
  • Since tentative version is only written when transaction is
    committed, read operations may have to wait until the last
    transaction to write has committed.



An operation in transaction Ti with start time TSi
is valid if:

  • The operation is a read operation and the object was last
    written by an older transaction ie
    TSi > WTM(x). If read permissible,
    RTM(x) = MAX(TSi, RTM(x))
  • The operation is a write operation and the object was
    last read and written by older transactions ie

    TSi > RTM(x) and
    TSi > WTM(x). If permissible,
    WTM(x) = TSi


2.15.5 Summary



  • Locks are commonest ways of providing consistent concurrency
  • Optimistic concurrency control and timestamping used in some
    systems
  • But, consistency in concurrency is application dependent - for
    shared editors, people may prefer to trade speed of execution
    against possibilities of conflict resolution. Problems can occur
    with long term network partition. Approaches based on notification
    and people resolution becoming popular.



2.16 Distributed Transactions





  • Models for distributed transactions
  • Attaining distributed commitment
  • Distributed Concurrency Control


2.16.1 Single Server Transactions




\includegraphics{figs/single-server-trans}



  • Till now, transactions have referred to multiple clients, single
    server.
  • How do we have multiple clients interacting with multiple
    servers? eg complicated funds transfer involving different accounts
    from different banks?
  • Generalise transactions to distributed case...

2.16.2 Distributed Transactions


2.16.2.1 Distributed Transaction Requirements


General characteristics of distributed systems

  • Independent Failure Modes
  • No global time
  • Inconsistent State



Need to consider:

  • how to achieve distributed commitment (or abort)
  • how to achieve distributed concurrency control


2.16.2.2 Models


\includegraphics{figs/transaction-models}



  • If client runs transactions, then each transaction must complete
    before proceeding to next
  • If transactions are nested, then transactions at same level can
    run in parallel
  • Client uses a single server to act as coordinator for all
    other transactions. The coordinator handles all communication with
    other servers



Question: What are the requirements of transaction ids?


2.16.3 Atomic Commit Protocols





  • Distribution implies independent failure modes, ie machine can
    fail at any time, and others may not discover.
  • If one phase commit, client requests commit, but one of
    the server may have failed - no way of ensuring durability
  • Instead, commit in 2 phases, thus allowing server to request abort.


2.16.3.1 2 Phase Commit



  • One coordinator responsible for initiating protocol.



  • Other entities in protocol called participants.
  • If coordinator or participant unable to commit, all parts of
    transaction are aborted.



  • Two phases
    Phase 1Reach a common decision

    Phase 2Implement that decision at all sites


2.16.3.1.1 2 Phase Commit Details





  1. Phase 1 The coordinator sends a Can Commit? message to all
    participants in transaction.
  2. Participants reply with vote yes or no. If vote
    is no participant aborts immediately.
  3. Phase 2 Coordinator collects votes including own:

    1. If all votes are yes, coordinator commits transaction
      and sends DoCommit to all participants.
    2. Otherwise transaction is aborted, and coordinator sends abortTransaction to all participants.


  4. When a participant recieves DoCommit, it commits its
    part of the transaction and confirms using HaveCommited

2.16.3.1.2 2 Phase Commit Diagram




\includegraphics{figs/2phasecommit}


Note:

  • If participant crashes after having voted to commit, it can ask
    coordinator about results of vote.
  • Timeouts are used when messages are expected.
  • Introduces new state in transaction Prepared to commit.

2.16.4 Distributed Concurrency Control


2.16.4.1 Locking



  • Locking is done per item, not per client.
  • No problems generalising to multiple servers...
  • ...except in dealing with distributed deadlock
  • Same techniques as usual, but interesting dealing with
    distributed deadlock detection.


2.16.4.2 Optimistic Concurrency Control



  • Need to worry about distributed validation
  • Simple model of validation had only one transaction being
    validated at a time - can lead to deadlock if different cordinating servers
    attempt to validate different transaction.
  • Also need to validate in correct serialisable order.
  • One solution is to globaly only allow one transaction to
    validate at a time.
  • Other solutions is to validate in two phases with timestamp
    allocation - local, then global to enforce ordering.


2.16.4.3 Timestamping



  • If clocks are approximately synchronised, then timestamps can be


    < localtimestamp, coordinatingserverid >


    pairs, and an ordering
    defined upon server ids.


2.16.5 Summary





  • Nested Transactions are best model for distributed transactions
  • Two Phase Commit protocol suitable for almost all case
  • Distributed Concurrency control is only slightly more diffcult
    than for single server case

2.17 Transactions: Coping with Failure



  • Failure Modes
  • Recovery Techniques
  • Partitions and quorum voting


2.17.1 Failure Modes


For Transactions to be atomic and durable, need to examine
failures

  1. Transaction-local failures, detected by the application which
    calls abort eg insufficient funds. No info loss, need
    to undo changes made.
  2. Transaction-local failures , not detected by application, but by
    system as whole, eg divide by zero. System calls abort.
  3. System failures affecting transactions in progress but not media
    eg CPU failure. Loss of volatile store and possibly all
    transactions in progress. On recovery, special recovery manager
    undoes effects of all transactions in progress at failure.
  4. Media failures affecting database eg head crash. No way of
    protecting against this.





2.17.2 Recovery



  • We assume a machine crashes, but then is fixed and returns to operation2.5.
  • We need to recover state to ensure that the guarantees of the
    transactional systems are kept.
  • Use a recovery file or log that is kept on permanent storage.


2.17.2.1 The Recovery Manager





  • Recovery from failure handled by entity called Recovery Manager.
  • Keeps information about changes to the resource in a recovery file (also called Log)
    kept in stable storage - ie something that will survive failure.
  • When coming back up after failure, recovery manager looks through
    recovery file and undoes changes (or redoes changes) so as uncommitted
    transactions didn't happen, and committed transactions happened.
  • Events recorded on Recovery file for each change to
    an object in database.


2.17.2.2 Recovery File


Information recorded per event include:
Transaction IdTo associate change with a transaction
Record IdThe identifier of the object
Action typeCreate/Delete/Update etc
Old ValueTo enable changes to be undone
New ValueTo enable changes to be redone

Also log beginTransaction, prepareToCommit, commit, and abort actions,
with their associated transaction id.


2.17.2.3 Recovering


If after failure,

  • the database is undamaged, undo all changes made by
    transactions executing at time of failure
  • the database is damaged, then restore database from archive and
    redo all changes from committed transactions since archive date.

The Recovery file entry is made and committed to stable storage before
the change is made - incomplete transactions can be undone, committed
transactions redone.


What might happen if database changed before recovery file written?


Note that recovery files have information needed to undo transactions.



2.17.2.4 Checkpointing




Calculation of which transaction to undo and redo on large logs can be
slow.


Recovery files can get too large


Instead, augment recovery file with checkpoint

  • Force recovery file to stable storage
  • Write checkpoint record to stable store with

    1. A list of currently active transactions
    2. for each transaction a pointer to the first record in
      recovery file for that transaction


  • Force database to disk
  • Write address of checkpoint record to restart location atomically


2.17.2.5 Recovering with checkpoints




To recover, have undo and redo lists. Add all active transactions at
last checkpoint to undo list

  1. Forwards from checkpoint to end,

    1. If find beginTransaction add to undo list
    2. If find commit record add to redo list
    3. If find abort record remove from undo list


  2. backwards from end to first record in checkpointed transactions,
    execute undo for all transaction operations on undo list
  3. Forwards from checkpont to end, redo operations for transactions
    on redo list



At checkpoint can discard all recovery file to first logged record in
checkpointed transactions



2.17.2.6 Recovery of the Two Phase Commit Protocol




\includegraphics{figs/2phasecommit}

  • Coordinator uses prepared to signal starting protocol,
    commit on signalling DoCommit and done to indicate end
    of protocol in recovery file.
  • Participant uses uncertain to indicate that it has
    replied yes to commit request, and commited when it
    receives DoCommit.
  • On recovery, coordinator aborts transactions which reach
    prepared, and resends DoCommit when in commit state
    but not done
  • Participant requests decision from coordinator if in
    uncertain state, but not commited.


2.17.3 Network Partition


Transactions are often used to keep replicas consistent.


If network partitions (cable breaks), replicas divided into two or more sets
(possibly with common members).


\includegraphics{figs/partition}


Can we still write and read from any of the sets?


Yes, but

  • Must reduce possible read and write sets to maintain consistency
  • Or relax consistency requirements and resolve problems when
    partition is healed


2.17.3.1 Quorum Consensus


Consider set of replicas, where replicated objects have version
numbers at each replica.

  1. Assign a weighting of votes to each replica, indicating importance.
  2. For client to perform operation, it must gather votes for all the
    replicas it can talk to (denote X).
  3. X$ \ge$ votes set for read quorum R to enable read
  4. X$ \ge$ votes set for write quorum W to enable write.
  5. As long as

    • W > half the total number of votes
    • R + W > total number of votes in group

    Each Read quorum and each write quorum will have at least one member
    in common.


2.17.3.2 Partition Example


For three replicas, R1, R2, R3, we can allocate votes to give
different properties depending upon requirements





















Replicaconfig 1config 2config 3
R1121
R2011
R3011


  1. What should R and W be set to in the three
    configurations?
  2. What properties result from these configurations?


2.17.3.3 Read and Write Operations




When write happens, object has a version number incremented


To read, collect votes from replica managers with version numbers of
object. Guaranteed to have at least one up to date copy if in read
quorum, from which read occurs.


To write, collect votes from replica managers with version numbers of
object. If write quorum with up to date copy not discovered, then
copy up to date copy around to create write quorum. Then write is
allowed.


Manipulating R and W give different characteristics eg R = 1 and W = number of copies gives unaminous update.


Cached copies of objects can be incorporated as weak
representatives
with 0 votes, but usable for reads.



2.17.4 Summary



  • Atomicity comes from using logging techniques on operations at
    server, where log is kept on stable storage
  • Voting can be used to give availability for resources on
    partitioned replicas.


3. Exercises and answers



没有评论: