2009年5月25日星期一

电影评论

电影评论
七龙珠 ---垃圾，既无场面有无情节
变形金刚 ---不错，但是如果不看高清，基本就没有多少意义。
流浪狗之家.Hotel.for.Dogs.2009 ---中，
马达加斯加2：逃往非洲.Madagascar.Escape.2.Africa.2008 -----5分，搞笑的智慧一流

2009年5月12日星期二

联通GPRS包月重复收费问题

4月29号，给10010拨打电话，要求将10元包80M的GPRS流量包在下个月变更为5元套餐。当时10010客服回复会在48小时内变更成功。5月8号，在www.10010.com查询
4月份的月固定费用为37元。而此前的2，3月份都是32元。
05月12日，拨打10010电话，第一次，客服回答是GPRS包月产生15元费用，并回复其重复收费是应该的、合理的；第二次拨打10010，客服5991，答应过后回复电话。
我既然购买了10元包80M的GPRS流量包，且按照中国联通的说法，在4月初已经将10元扣除；在联通扣除10元费用时，我就拥有了在4月份使用80M流量的权利，不应再为GPRS支付任何额外的费用。

联通公司共有以下2条霸王条款：
1，开通GPRS包月，要在0-48小时生效，而取消GPRS包月，则会立即生效，不管你是否已经缴付费用。用户没有权利使用已经支付费用的服务。
2，用户变更GPRS包月，则会在变更时重复收取费用。用户需要为已经支付费用的服务再次付费。

要求联通公司返还重复收取的费用，并取消以上2条霸王条款。

2009年4月28日星期二

项目管理及C++基本编码守则-jhonglei

每天总结自己编写代码中所犯的错误，并反思出错的原因；总结所获取的经验；归纳编程的基本原则。

文档的编写：

以后所有提交的文档的文件名中写三部分：文件名-姓名-日期。文件名要简洁，语义清晰。这样文档多了容易查找

程序开发

　变量：命名规范
　函数的编写　
注释：
　错误处理　函数参数检查
　内存管理
　版本控制
　日志系统
　
阅读以下书籍：
《C++ Primer》
　　《Effective Ｃ＋＋》
以下几点是在程序开发过程中实际发现的应当遵守的地方。

错误处理：

有限状态自动机是保证编程正确性,安全性的有力武器。比如在网络编程时，状态记录有助于排错。

使满足状态的包才会被处理,不满足状态变迁的包将被丢弃。如只有登陆后才进行继续的处理。防止恶意包和欺骗行为。

任何不满足格式要求的包将被抛弃。

一定要对收到的数据包的每一个字段进行长度检查，不满足长度要求得数据包进行抛弃。
对数据包的源地址进行检查，不满足长度要求进行抛弃。

例如：
eyeBeam，语音和视频从不同的IP发送是不可以接受的。nettalk中途更换语音的发送IP也是不可以接受的。其语音包的负载大小：也必须是20个字节：20ms的数据。
对于大于20的语音包，直接抛弃。如我们的nettalk.当视频RTP包发送到语音RTP包的接收端口时，由于没有做长度检查，导致内存越界，使得接收SOcket被改写。。

指针越界问题：使用必要的边界检查。预防缓冲区溢出漏洞

如果可能，进行必要的协议分析，以保证程序的正确性。

2，函数必须有两种返回值，1），程序是否正确执行，2），程序执行的结果

错误检查是保证程序健壮性的必要因素。

3，能用if条件判断的，不使用异常。

4，对于两种不同的协议，使用不同的端口进行处理。RTPRelay的命令包和RTP数据流应当分开。

5, 遵守最基本的命名规范，成员变量全部为private, 以m_*开头。。变量名应当容易理解，见名知意。

不要使用什么　dd ,zhu等看不出含义的名字。

变量的作用域越大，变量的命名应当越谨慎。

如有可能，不要使用全局变量。

禁止使用如下方式：

　　　#define PORT 18800

原因：很有可能出现宏重定义，且名字也不清晰。

6，动态内存的管理：尽量少用。

使用内存检查工具检查内存内存泄露，如valgrind，但不要完全把希望寄托到检查工具上。

编写内存安全的代码是程序开发人员的责任。Win32的RuntimeChecker

使用相匹配的new/delete

    char *pbuf = new char[len+1];

    delete [] pbuf;

使用valgrind:

[hongleij@hustp2psrv bin]$ valgrind -v --leak-check=full ./meridianClient

==11443== 1 errors in context 7 of 9:

==11443== Conditional jump or move depends on uninitialised value(s)

==11443==    at 0x418AE16F: pthread_mutex_init (in /lib/libpthread-2.5.so)

==11443==    by 0x418128DC: pthread_mutex_init (in /lib/libc-2.5.so)

==11443==    by 0x4010877: CMeridianThread::CMeridianThread() (CriticalSection.h:55)

==11443==    by 0x4015A80: StartRTPRelay (RTPMain.cpp:17)

==11443==    by 0x804868B: main (Client.cpp:7)

使用必要的初始化函数：

    CriticalSection()

{

   pthread_mutexattr_t mutex_attribute;



   pthread_mutexattr_init(&mutex_attribute);//add jianghonglei 081010



   pthread_mutexattr_settype(&mutex_attribute, PTHREAD_MUTEX_RECURSIVE);



   pthread_mutex_init(&mutex_, &mutex_attribute);



   pthread_mutexattr_destroy(&mutex_attribute);//add jianghonglei 081010

}

7，使用CVS或SVN服务器来进行版本控制，并对文件目录的设置进行规范。
　　　维护以下部分：
　　　　　源代码，文档，参考资料
              尤其是跨平台程序的开发。win工程文件使用单独的目录，代码与配置清晰
              设置的目的不仅仅维护代码，而且包括文档，
              每个用户有不同的权限，每次提交必须添加注释，表明这次修改的目的，和测试结果

8，编写可复用的代码。相当于对代码进行了多次测试。品质更容易保障。

9，统一的日志输出格式：正确区分编程错误，网络IO错误，和异常情况。可以支持模块化调试。

     如只输出某个或某几个模块的调试信息。

1），   调试信息以模块名开头如：KAD

2），并区分错误信息的等级，类型：

3），输出时间

10：绝不允许出现类似如下的情况：

   char * pBuf =NULL;

   std::cout<<pBuf;

11：不允许出现函数需要返回值却没有返回值的情况，例如

   int func()

{

　　 if(true)

{

　　　 return 1;

　　 }

//this place should add return !!!

}

最好保证函数有单一的入口和出口。

12: 调用网络相关函数，必须对错误情况/返回值进行检查。不允许不做检查的情况出现。！！！

    sendto, recvfrom

    ReceiverAddr.sin_family = AF_INET;//IMOPTANT!!!!

                ReceiverAddr.sin_port=htons(18888);

                ReceiverAddr.sin_addr.s_addr=inet_addr(iter->ip);

13：为防止端口冲突，程序中所有端口在同一文件中配置。

14: 每行代码不超过80个字。目的：在比较时容易忽略掉。

15: C++异常的处理，如果使用，应当尽可能在最内层处理。

16：数据一致性是程序正确性最根本的保证。应当充分尊重数据库的黄金法则：“同一个意思只表达一次”。至少包括两层意思：代码简练，少冗余。

17.不得做如下调用，在同一函数中调用2次inet_ntoa函数；

    RTPAdd(localIP,AVIO_PORT,

        inet_ntoa(pMgr->m_CalleeAddr.sin_addr),ntohs(pMgr->m_CalleeAddr.sin_port),

        inet_ntoa(pMgr->m_SuperNode.sin_addr),MERIDIAN_COMMAND_PORT

        ,NULL,NULL,GetTickCount());

18: 调用sendto()时，必须是系统支持的库函数
void CAVIOMgr::OnRelayCommand(const char * ppc,int recvPackLen,SOCKADDR *recvAddr)
{
            SOCKADDR_IN pingAddr ;//TODO 为什么必须要进行一次转换？？
            memcpy(&pingAddr,recvAddr,sizeof(SOCKADDR_IN) );
            int bytesSended = m_Socket.SendTo((const char *)tmpBuf.GetBufferPrt(),tmpBuf.GetWritePos(),(SOCKADDR*)&pingAddr);
        //否则会产生10047 WSAEAFNOSUPPORT 错误

}
19,select的使用:
                if( FD_ISSET( pMgr->m_Socket.GetSocket(), &writefds ) )
                {
                    pMgr->processPackSend();
                }
                if( FD_ISSET( pMgr->m_Socket.GetSocket(), &readfds ) ) //不是 else if
                {
                }
20：创建项目的Bug错误列表。
        表明程序的修订状况。
21：出现Stack错误。
      在调试状态没有问题，不调试出现问题。
     原因：某些变量在使用时没有初始化。。
22:　保证程序的健壮
　　　程序的

================================

l99.com -> 相机照片，时间，GPS数据整合 -〉应用

未来的信息搜索 --〉依时间轴依次排列，关联度（where ,what,time），

拼车网 --〉撮合率凡事预则立（临时需求）

火车票

游戏组队

             信息的过滤和筛选--〉有用信息，BBS，

51758855--719

但>2M*7K=

实际用户数：license

    >30-40 CPS

    2get 1put 30*2 =60读

用户数据量：IMS标准 3GPP 30K

CSF

话单

DNS 解析:

nokia sip stack

ims mar maa

2009年4月22日星期三

Dell D620 H264 1080P 硬解码与播放器的选择

暴风影音2009不支持Dell D620 NVS110M 的H264硬解码，而PowerDVD支持！

Quadro NVS 110M/120M都采用G72M的显示核心，110M的性能相当于Geforce GO 7300，120M则相当于Geforce GO 7400。
Quadro NVS 140M采用G86M核心，性能应该相当于Geforce     8600GS或者8400GT。 Quadro NVS系列显示芯片的优势并不在强大的3D性能，如Quadro NVS 110M在3DMark 03下得分只有3308。根据nVIDIA的官方资料，Quadro NVS笔记本型电脑解决方案提供符合现今专业人士要求的可靠度、稳定度、易用性，并通过各种商业应用软件的兼容性测试。

硬件：Dell D620
CPU： T2300E
内存：2*512M
显卡：Quadro NVS 110M
OS ：WinXP SP3
驱动： ForceWare 174.31 DEll网站的R181148.exe
电影：Transformers.Blu-ray.REMUX.H264.1080P.TrueHD.DD51.SILUHD.disk1.ts
根据本人的测试结果：
暴风影音2009 "1对1"特别版[3.09.03.25版]
     不支持H.264硬件解码，播放1080P电影时CPU占用率100%.其所谓的“全面支持市面所有已知的74种高清加速显卡”纯属虚假宣传，误导消费者。
PowerDVD.Ultra.Deluxe.v7.3
      支持H.264硬件解码，播放1080P电影时CPU占用率50-60%

分布式存储的研究现状

由于单一的服务器无法存储数百TB的数据，无法每秒处理百万级的并发请求，分布式存储系统应运而生。例如截至08年09月，eBay存储了超过2PB的数据,每天需要处理25PB数据，执行480亿次SQL查询。另外，出于规模经济的原因，大的数据中心比小的数据中心有显著的成本优势：在一个5万台服务器的数据中心和1千台服务器的数据中心相比较，平均网络带宽和管理成本前者是后者的1/7.1，而平均存储成本则只有后者的1/5.7。存储系统中数以万计的服务器，使得任何组件出现短暂或永久性故障都成为常态，网络的时延、带宽、成本约束，应用对高性能、高可用、高延展性、易于管理的要求，都对分布式存储系统的设计提出了严峻的挑战。

问题的来源

为了应对各种失效情况，提高可用性和性能，降低通讯成本，通常采用数据的冗余策略，而副本（采用多台失效相互独立的服务器）和纠删码是两种常用的方式。这样即使其中一些复本失效或出现网络分隔(Network Partition)的情况，用户也可以访问到一些服务器，从而继续操作。通过服务器复本以及协调客户端与服务器复本间的交互,复本状态机（Replication State
machine）成为实现容错服务的通用方法。Chubby[OSDI'06,PODC'07], ZooKeeper[Apache]和Boxwood
[OSDI’04]等服务都使用了复本状态机。这种方式也提供了用来理解和设计复本管理协议的架构。许多协议都包含了数据或软件的复本,用于屏蔽错误或者在没有中心控制的情况下协作。确定性(Deterministic)是一种理想可以提供容错的性质。直观上，如果一个系统有多个复本存在，错误很容易因不同复本的状态或输出不同而被发现。通常需要2F+1的节点来纠正F个错误节点的情况。特别的，如果失效复本不产生输出，则只需要F+1个复本。拜占庭错误是指错误节点会向不同节点产生不同输出，错误节点可以产生随机、伪造的错误或者恶意、有智能的攻击。这种情况至少需要3F+1才能纠正F个错误节点产生的错误信息，也即著名的Byzantine-Fault
Tolerance (BFT)问题。对于一个确定的服务，服务器状态和要执行的功能都复制到一个服务器集合中每台服务器复本，复本间使用协商协议（consensus protocols）就要执行的命令达成一致。常用的协商协议有Paxos、Fast Paxos[Distributed Computing'06]和CoReFP，Mencius[OSDI'08]是一种来源于Paxos的用于广域网的多领导状态机复本协议，以达到在高负载时达到高吞吐率、低负载时低时延的目的，自适应变化的网络和负载状况。近年，如OSDI’06、OSDI’08、NSDI‘09都有关于复本状态机相关的论文。特别对于BFT问题，SOSP'07、NSDI’08、NSDI'09都设有专题讨论，ICDCS'08和USENIX’09也有相关论文。MIT、CMU、RICE、德克萨斯大学、Intel、Microsoft都有学者从事相关研究，成为近年学术前沿的研究热点问题。例如，Zeno[NSDI'09]通过弱化一致性，来提高BFT的性能和可用性。德克萨斯大学奥斯汀分校的Allen Clement等人在NDSI'09的论文中发现现有BFT协议无法有效应对客户端的恶意请求，由此提出容忍拜占庭错误的BFT系统。

根据CAP理论（Consistency, Availability, and Partition-tolerance），数据一致性，数据的可用性和网络分隔容忍三者最多只能满足其中的两个。数据的一致性是指用户总是得到正确的数据；数据的可用性是指用户总是能够得到相应；网络分割容忍是指允许网络丢失节点间的任意消息。当网络分裂时，从一个网络分区节点发往另一网络分区中节点的消息都会丢失。对于只读存储系统，数据一致性很容易得到保障；而对于可以执行更新操作的存储系统，为保证业务逻辑的正确性，必须设计恰当的数据一致性模型，使一致性得到某种程度的保证。传统的关系性数据库通过提供我们熟知的串行化事务来提供并发操作的数据一致性。但是对于分布式系统，提供串行化事务在性能和可用性上是不可行的。而根据Yahoo!的许多Web应用的经验，这些应用倾向于每次操作一条记录，通用的事务机制通常并不需要；例如，用户更改个人形象，上传一张图片，邀请若干好友发起会议，即使新的个人形象对于一个好友没有马上可见，也是没有多少影响的。事实上许多分布式复本系统只提供了最终一致性（eventual consistency）：客户端每次更新时，只需更新该对象的任一复本；系统只保证该对象的所有更新最终会应用到每个复本，而对于不同复本，更新顺序可能不同。但是，这种最终一致性通常太弱，而不适用于例如Web等应用。例如Amazon的Dynamo利用Vector clocks 在读取时重新排列更新以解决某些更新冲突，而Yahoo!的PNUTS，提供每记录的时间轴一致性模型来提供不同等级的一致性。而WheelFS允许应用程序通过语义线索（semantic cues）控制数据一致性。

分布式存储系统的分类

按照系统功能可以划分为：归档/备份、通用文件系统、发布/共享、高性能等等。按照存储架构：Client/Server、P2P;其中P2P架构又可划分为全局中心架构、本地中心架构和纯P2P架构。也可以按照操作环境，使用模式，对于数据一致性的要求，安全性，自动管理，路由策略和网络环境等进行分类。总的来说，根据应用场景，可以分为由广域网、非信任节点组成的存储系统和面向数据中心的存储系统。不同存储系统提供的接口也不同，有分层（文件系统）命名空间和扁平（P2P）命名空间；有的提供复杂的结构化查询，有的只提供简单的key-value查询。另外存储系统的节点规模、类型（PC或Server）、地理分布及网络连接方式也可以做为分类依据；有的提供开放的协议，而有的完全私有。另外，不同存储系统的业务需求，I/O模式也千差万别：有的是只读系统，有的则要求支持更新操作；不同应用对I/O时延的要求也不相同；例如对于电子商务，为了良好的用户体验，必须提供“总是可写”的方式，并且用户的请求必须在数百毫秒内的得到相应。下面就近年出现的工业界和学术界的存储系统大致归类：

广域网、非信任节点

Oceanstore, CFS, PAST, Ivy[OSDI’02],Glacier[NSDI'05],TFS[FAST'07],WheelFS
[NSDI'09,MIT]，CA-NFS [FAST’09]

数据中心

分布式虚拟磁盘: Petal [lee’96]

分布式文件系统 : CEPH [OSDI'06], Farsite [Microsoft]

集群文件系统 : Sorrento [Supercomputing'04], Panasas [FAST'08], GoogleFS
[SOSP'03]

集群存储系统 : Ursa Minor [FAST '05], RADOS [PDSW '07], and FAB [SIGOPS'04].
Dynamo (SOSP’07,Amazon)，PNUTS (VLDB'08,Yahoo!)

次级存储系统

归档:Venti
[FAST '02], EMC Centera, Pergamum [FAST'08]

备份:
Pond[FAST'03],DataDomain [FAST'08] ,Cumulus[FAST’09]，SafeStore[USENIX’07],

归档+备份
: HYDRAstor [FAST'09]

其他：

Antiquity (EuroSys‘07, Berkeley),TieStore [FAST‘08] , Carbonite
[NSDI’06],Cimbiosys(NSDI'09)，BitVault[SIGOPS'07]

主要存储系统简介

Coda：支持离线操作，C/S模式

CFS : 只读文件系统, CHORD,文件分块，存储数据块到负责节点及K个后继节点.

PAST：只读文件系统，Pastry，根据文件ID，存储文件到负责节点及K个后继节点

Low-Bandwidth File System (LBFS)：改进ＮＦＳ协议以降低带宽需求，关注一致性而不是在网络分割存在时的可用性

Farsite：创建一个可扩展的无服务器的分布式文件系统来提供中心化的NTFS文件服务器语义，并用POSIX协议严格兼容。Farsite的节点由同一大型组织的桌面计算机组成，基于高速低时延的网络环境，并能处理拜占庭失效问题。

TieStore（FAST’08,Berkeley）面向带宽受限、间断连接，利用DTN（Delay Tolerant Networking）的存储-转发网络组织覆盖网。

WheelFS 广域网文件系统，通过坚持“全局读，本地写”来降低通讯成本，提供语意线索由应用定制对时延、一致性等的需求。

Antiquity（EuroSys '07）使用安全日志结构来保证数据完整性，将每条日志复制到多台服务器来提供数据持久性，使用拜占庭容错协议来保证数据一致性，使用quorum修复协议来替代丢失的复本。在有超过quorum (threshold)数目的复本可用时允许用户执行写操作。

OceanStore 提供全局的、事务的、持久存储服务，并支持广域的复本数据串行化更新。采用冲突消除策略来降低事务被丢弃的数目。OceanStore基于Tapestry后缀路由算法设计，同时支持纠删码和完整复本两种冗余策略，不同于CFS和PAST。文件碎片及其复本能够自由的存储在任意节点上。为此OceanStore在文件ID对应的根节点处维护碎片位置信息，此外根节点负责通过心跳检测碎片所在的节点状态，当出现节点错误的时候，数据的多个根节点联合决定修复数据的策略。通过数据多版本化来支持数据更新，即数据的每次更新都不会覆盖原有数据，而是产生数据的一个新版本，从而绕开复杂的数据一致性维护问题。

GoogleFS

系统由价格低廉的服务器构成，分步式文件系统，由一台主服务器管理所有元数据，数据被切分成Chunk存储在Chunk服务器。设计用于应对大量的读操作，而基本不改写数据。

======================================

节点的成员管理和失效恢复，Gossip

连接、节点的异构性

问题分类：

CAP问题

广域网P2P存储

数据中心p2p存储:Dynamo

安全性

弱一致性存储系统

Weakly Consistent Storage Systems (i.e. Eventual Consistency)

Replication and Fault Tolerance

Scalability and Fault Tolerance

Wide-area file systems

Eventual Consistency：最终一致性

分布的，多复本，以及支持并发访问都是挑战一致性的因素，性能，可用性，架构选择与一致性之间的权衡。

reading notes for <<Above the Clouds: A Berkeley View of Cloud Computing>>

不知道在那里存这些技术笔记，索性先放在这里。将来再找更合适的地方。

今天读了《云之上》，还是有不少启发。看到Berkeley 和David Patterson的名头就进去了。顺便提一下，Randy的那本讲体系结构的书一直没有找到时间看，看到Patterson，又想了起来。

主要的东西原来也陆陆续续看到了一下，所以启发还是来自于细节。

1) 首先看到的一个规模经济的比较，可以看出来大的DC比小的DC还是有显著的优势（7X)，可以预见，如果云计算普及，势必会是相当集中的行业，而现在

大型DC具有不可忽视的先发优势，这恐怕也是大公司拼命忽悠的主要原因。

Table 2: Economies of scale in 2006 for medium-sized datacenter (~=1000 servers) vs. very large datacenter (~=50,000

servers). [24]

|Technology     | Cost in Medium-sized DC     | Cost in Very Large DC        |Ratio |

|Network        | $95 per Mbit/sec/month      | $13 per Mbit/sec/month       |7.1 |

|Storage        | $2.20 per GByte / month     | $0.40 per GByte / month      |5.7 |

|Administration | ~=140 Servers/Administrator | >1000 Servers/Administrator | 7.1 |

2）电力价格差距还是很大的，这样对于中国西部倒是一个机会，不知道电力占成本的比例。粗略的算一下，一台普通台式机，负载100%，一天大致3度电，一度电大约0.45元（上海峰谷平均），这样一年1000元，好像跟我们公司大体一致。这样，大致与一台计算机的一年折旧费用大体相当。这样，电费是运营成本中不可忽视的因素。

Table 3: Price of kilowatt-hours of electricity by region [7].

|Price per KWH | Where | Possible Reasons Why |

|3.6¢          | Idaho | Hydroelectric power; not sent long distance |

|10.0¢         | California | Electricity transmitted long distance over the grid;limited transmission lines in Bay Area; no coal

fired electricity allowed in California. |

|18.0¢         | Hawaii | Must ship fuel to generate electricity |

3）毫无疑问，最为稀缺的资源是带宽。这跟我们平常计算一致。就我这次对我们cluster的性能简单评估的经验来看，瓶颈也是网络速度。可是这也是云计算所不能解决的问题。当然，如果数据也产生于云里，这个问题倒还好办。所以，云计算首先也解决的是云存储，一旦数据进了云里，很多问题就好说了。同时意味着数据的迁移成本也是巨大的。换句话说，还是巨大的先发优势。

Table 5: We update Gray’s costs of computing resources from 2003 to 2008, normalize to what $1 could buy in 2003

vs. 2008, and compare to the cost of paying per use of $1 worth of resources on AWS at 2008 prices.

|              | WAN bandwidth/mo. | CPU hours (all cores) | disk storage |

| Item in 2003 | 1 Mbps WAN link   | 2 GHz CPU, 2 GB DRAM | 200 GB disk, 50 Mb/s

transfer rate |

| Cost in 2003 | $100/mo.          | $2000                 | $200         |

| $1 buys in 2003 | 1 GB | 8 CPU hours | 1 GB |

| Item in 2008 | 100 Mbps WAN link | 2 GHz, 2 sockets, 4

cores/socket, 4 GB DRAM |

1 TB disk, 115 MB/s sustained

transfer |

| Cost in 2008 | $3600/mo. | $1000 | $100 |

| $1 buys in 2008 | 2.7 GB | 128 CPU hours | 10 GB |

| cost/performance improvement | 2.7x | 16x |10x

| Cost to rent | $1 | $0.27–$0.40 | $2.56 | $1.20–$1.50 |

4）值得注意的是，Amazon EC2并没有停运过，我想这会不会意味着EC2用户相比S3还是少？大多数付费用户还是用云存储？ EC2用的认证服务与S3不同？相比较公司内部IT的故障，这个故障时间还是可以忍受的。

Table 7: Outages in AWS, AppEngine, and Gmail

Service and Outage Duration Date

S3 outage: authentication service overload leading to unavailability [39] 2 hours 2/15/08

S3 outage: Single bit error leading to gossip protocol blowup. [41] 6-8 hours 7/20/08

AppEngine partial outage: programming error [43] 5 hours 6/17/08

Gmail: site unavailable due to outage in contacts system [29] 1.5 hours 8/11/08

5）文中还讨论个老问题，皮鞋传输和网络传输的比较。就是用快递，可以达到1500 Mbit/sec的速度，而用网络，却是20 Mbit/sec。因此如果发展云计算，估计数据中心可以养活一大批快递。

BTW, how to create a table here?

Reference links:

http://www.hpcwire.com/features/Berkeley-Releases-Cloud-Computing-Study-39502692.html?page=1

http://www.amazon.com/Computer-Systems-Programmers-Randal-Bryant/dp/013034074X/ref=sr_1_1?ie=UTF8&s=books&qid=1235294059&sr=1-1

2009年4月20日星期一

VGA是鸡肋？14款电视PC连接横向评测!

CNET中国·ZOL 08年03月18日【原创】作者：
中关村在线沈研责任编辑：

今天我们要讨论一个事实，随着平板电视价格逐步的下降，许多朋友买电视已经不光是为了收看电视节目这么简单了。对于更多的玩家、IT、硬件爱好者来说，购买电视的主要目地在于游戏娱乐，例如连接电脑上网、玩游戏等，这样一来平板电视就成了一台大尺寸的液晶显示器。

平板电视比显示器发展更迅速，其数字接口提前进化到了HDMI，而目前绝大多数显示器点标准配置还停留在DVI接口。而且PC的显卡提供HDMI接口的也是少数，因此，平板电视也同样提供了更加普遍的D-SUB接口，通过VGA线来连接（提醒：最好买带磁环的好VGA线）。

VGA是鸡肋？15款电视PC连接横向评测!

有了D-SUB接口，基本上就能够满足所有朋友用电脑连接电视的需求了。但是这里还有一个玄机。提供了D-SUB接口的电视也不一定能够享受到等同于显示器的显示效果。由于电视所使用的液晶面板和显示器不同，所以在驱动IC
和接口芯片上都需要对PC分辨率进行特别优化才能够实现点对点的全屏幕显示，而据笔者调查，目前市面上绝大多数产品都没有对VGA接口进行过优化设置，这
样一来用电脑连接电视虽然能够显示出画面，但却无法开启电视真正的分辨率，或则出现严重的过扫描现象，导致无法点对点显示，如此就会产生比例不依，惨不忍
睹的粗糙画面。

● 总体测试结果报告

本次14款测试机型中，只有一款1366×768分辨率的机型，其余产品一律为1920×1080分辨率，本次测试结果并不理想，下面我们来看看测试结果。

产品名称	VGA接口PC连接分辨率	是否通过测试？
LG 47LB5RE	1920×1080(点对点）	是
TCL 42E77F	1360×768	否
创维52L16HF	1920×1080(过扫描、偏色)	否
东芝40XF300C	1360×768	否
东芝42C3000C	1360×768	否
海信TLM46V68	1920×1080(过扫描)	否
日立L42X101	1280×1024	否
三星 46N81B	1920×1080(点对点）	是
厦华42R35	1600×1200	否
松下37LX7D（1366×786）	1360×768（点对点）	是
索尼 46W300A	1920×1080(点对点）	是
夏普52RX1	1400×1050	否
优派 4250p	1680×1050	否
长虹52700FHD	1280×1024	否
最后成绩	4款产品通过	10款产品失败

● 总结

在显示器领域，液晶显示器无论是VGA接口还是DVI接口，均必须达到点对点显示。而在平板电视行业，由于电视多数用途都体现不到点对点显示的画面，所以许多厂商便没有刻意的去对VGA接口进行严格要求，导致了用VGA接口连接电脑无法识别出电视分辨率的问题，造成许多游戏爱好者或者发烧友的不便。

通过本次市场调查评测，刻意看到目前对VGA接口要求严格的品牌只有四家，其余产品均没有通过测试，对PC连接方面有要求的朋友可以做一下参考。

我来说几句：D-SUB接口（就是通常我们说的VGA接口）支持1920*1080分辨率或者更高的绝对没有问题，到QSXGA的2560*2048都没有问题，根本不存在某些人所说的带宽问题。电脑或笔记本的显卡也支持1080P，你能找到的不支持1920*1080分辨率的显卡基本是古董级显卡，我的nVidia RIVA TNT2（16M显存）都支持，估计没有谁比我的显卡差。既然和VGA接口没有关系，和电脑也没有关系，那问题肯定就出现在电视上，由于液晶电视不只是需要液晶屏能够的物理分辨率达到1920*1080 ，其驱动IC
和接口芯片上都需要对对于VGA接口特别优化才能够实现点对点的全屏幕显示。从上面的测试结果来看，没有一款国内品牌的液晶电视的VGA接口能够完美支持1080P，可以在选购时看看电视说明书，VGA一般只到1024*768/60Hz。另外，据说用HDMI的话，有相当一部分液晶电视会出现无法点对点显示的情况(具体表现多为画面整体偏移、压缩等)，购买时最好实际测试。

Specifications
Graphics controller	nVidia RIVA TNT2
Memory	32 MB SGRAM
RAMDAC	Built-in 300 MHz
Bus type	AGP (4x/2x/1x) with full sideband/Execute mode support
VGA connector	DB-15 analogue monitor connector VESA DDC2B, DPMS, VBE 2.0/3.0
TV output	S-Video Composite (RCA)
LCD output - DFP connector (Optional)	DDWG (Digital Display Working Group) compliant Digital LCD Flat Panel output for up to 1280 × 1024 resolution
Vertical frequency	60 Hz - 240 Hz
Horizontal frequency	172.8 K
Fill rate (pixels/sec)	250 M
Triangles/second	5 M
Bandwidth	2.4 GB/s

Resolution, colour and refresh rates supported
Resolution	Colour	Refresh rate
640 × 480	256, 65 K, 16 M	60 - 240
800 × 600	256, 65 K, 16 M	60 - 240
1024 × 768	256, 65 K, 16 M	60 - 200
1152 × 864	256, 65 K	60 - 170
1152 × 864	16 M	60 - 150
1280 × 1024	256, 65 K	60 - 150
1280 × 1024	16 M	60 - 120
1600 × 1200	256, 65 K	60 - 100
1600 × 1200	16 M	60 - 85
1920 × 1080	256, 65 K	60 - 85
1920 × 1080	16 M	60 - 85
1920 × 1200	256, 65 K	60 - 85
1920 × 1200	16 M	60 - 75
2048 × 1536	256, 65 K	60 - 75

标准编号:SJ/T 11344-2006

标准名称:数字电视液晶背投影显示器测量方法

标准状态:现行

实施日期:2006-3-29

颁布部门:中华人民共和国信息产业部

内容简介:本标准规定了标准清晰度电视（SDTV）、高清晰度电视（HDTV）数字电视液晶背投影显示器（以下简称显示器）的测量条件和测量方
法，适用于高清晰度、标准清晰度液晶（LCD）背投影显示器、硅基液晶（LCoS）背投影显示器、数字光学处理（DLP）背投影显示器等固定分辨力背投影
显示器。

出处: http://www.csres.com/detail/173205.html

下载:http://www.csres.com/upload/qy%2Fin%2FSJT11344-2006.PDF

标准编号:SJ/T 11343-2006

标准名称:数字电视液晶显示器通用规范

标准状态:现行

实施日期:2007-1-1

颁布部门:中华人民共和国信息产业部

内容简介:本规范规定了数字电视液晶显示器（以下简称LCD显示器）功能和性能、检验方法、检验规则、标志、包装、运输、贮存等的通用要求。本规
范适用于数字电视液晶显示器，是产品设计、生产定型和检验的主要依据。对于兼容接收符合GB
3174-1995规定的液晶电视广播接收机和54cm以下的液晶显示器也可参照使用。

出处: http://www.csres.com/detail/173204.html

下载:http://www.csres.com/upload/qy%2Fin%2FSJT11343-20061.pdf

标准编号:SJ/T 11338-2006

标准名称:数字电视液晶背投影显示器通用规范

标准状态:现行

实施日期:2007-1-1

内容简介:本规范规定了数字电视液晶背投影显示器（以下简称“LCD背投”）功能和性能，检验规则、标志、包装、运输、贮存等的通用要求。本规范
适用于数字电视液晶背投影显示器，是产品的设计、生产定型、检验的主要依据。对于兼容接收符合GB
3174-1995规定的液晶背投影电视广播接收机也可参照使用。

出处: http://www.csres.com/detail/173199.html

2009年4月13日星期一

OpenCAP安装配置

Download and install OpenXCAP

The software has the following dependencies, which you must install on your
operating system:

LibXML (http://xmlsoft.org/downloads.html)
Python 2.5 or newer (http://www.python.org)
Twisted framework (http://twistedmatrix.com)
Twisted Core, Twisted Web and Twisted Web 2 (http://twistedmatrix.com)
python-lxml (http://codespeak.net/lxml)
python-application (http://pypi.python.org/pypi/python-application)
python-mysqldb
python-xml (_xmlplus package provided by pyxml library)
GnuTLS library (http://www.gnu.org/software/gnutls)
python-gnutls (http://pypi.python.org/pypi/python-gnutls)

另外，上面没有提到需要安装Zope

http://www.zope.org/Products/

注意事项：

Zope对于Python有特殊的版本依赖关系，不能太早，也不能太新；例如

         Zope 2.11/2.10 requires Python 2.4.5 (Python 2.4.4 is still acceptable). Older Python
versions are no longer supported. Python 2.5 is not supported
at this time.
   Zope-2.11.2-final.gz <-- Python-2.4.6.tar.gz
             Zope-2.10.7-final.gz <--
          这就是出现 error: package directory 'build/lib/linux-i686-2/4/zope/app' does not exist 的原因

例如RHEL5.3自带的版本是Python2.4.3,所以不行。要先下载Python编译安装后，再重新编译Zope。因此，上面所说的需要Python 2.5 or newer是错误的。
另外

./configure --prefix=/usr/local/zope2 --with-python=/usr/local/bin/python2.4
make
make install

2009年4月10日星期五

新闻也开始植入式广告

新闻也开始植入式广告

感觉中国人民太有才了。06年还《天下无贼》，一本正经的介绍什么叫“植入式广告”，免费为全国人民做科普知识，如今已经是普天盖地，近乎离谱，超越人民群众的心里底线了。如下：
==============================

专家揭秘IC卡安全危机：破解最慢仅需三小时

http://tech.QQ.com　
2009年04月10日17:01 　腾讯科技　孟鸿　

腾讯科技讯
4月10日消息，今日北京晨报一则IC卡安全算法已遭破解的报道引发网友强烈关注，腾讯科技连线汉王科技的专家石践先生，对这场IC卡引发的安全危机进行独家解读。

近日，工信部发文要求各地开展对IC卡使用情况的调查及应对工作，其北京市主要应用于IC卡系统的MI芯片的安全算法已遭到破解，全国约1.4亿张应用此技术的IC卡都将面临巨大的安全隐患。

据石践介绍，上述所谓MI芯片指的就是恩智浦半导体Mifare经典芯片。2006年，飞利浦将半导体业务部门拆分出来，成立恩智浦半导体公司。

“采用MI芯片的IC卡全球有10亿张，在中国约有1.4-1.5亿张”，来自石践的数据显示MI芯片在全球IC市场的占有率在八成左右。

破解IC卡最慢仅需三小时

尽管IC卡危机近日才广泛见诸报道，但实际上MI芯片遭破解已经是2008年的事情了。去年8月，Mifare的破解软件及硬件已在网络上公开售卖，售价仅为500美金。

“实际这已是公开的东西”，石践告诉腾讯科技对于这套破解的方法，只要是有些基础的或者业内人员就可以很快搞明白。

石践说这种破解方法的基本原理近似于“穷举法”，即便用一般的计算机，三个小时也足以完成分析和破解的整个过程。IC卡一旦被破解，就可以被肆无忌惮的复制了。

据悉，生产IC的是一种标准的机器，个人就可以买到，并不需要非常复杂的手续。

只要获得相应的机器，就可以复制被破解的IC卡。石践表示大多数IC卡系统在使用规则方面并没有进行更多限制，被复制的IC卡完全可以畅通无阻的使用。

可升级为人脸识别系统应对

“威胁主要在公交卡、门禁卡”等方面，石践说广州已经出现私下复制、更改IC卡内容，为公交卡进行充值的现象了，“这个影响非常恐怖”。

相对于公交卡，门禁卡等领域遭遇的威胁更为严峻。“公交卡被复制对个人的影响相对较小”，石践指出一旦门禁卡等安全级别更高的IC卡被复制，后果可想而知。

而解决的办法目前只能是等待系统升级。石践表示对于安全系数要求更高的单位，可以考虑采用生物特征进行加密的系统。

“比方升级为人脸识别系统，或者在卡片里面放入更多的个人数据和信息”，石践说采用上述更高级的技术之后，就算卡丢了或被复制了，也不会出现被恶意使用等安全隐患。

实际上汉王已在3月推出人脸识别产品，而这正是石践领导开发的。据透露，人脸识别技术算是指纹识别技术的一种升级，目前已经达到一秒识别15次的水平。(文/孟鸿)

==================================

专家揭秘IC卡安全危机：破解最慢仅需三小时

http://tech.QQ.com　
2009年04月10日17:01 　腾讯科技　孟鸿　我要评论(0)

据石践介绍，上述所谓MI芯片指的就是恩智浦半导体Mifare经典芯片。2006年，飞利浦将半导体业务部门拆分出来，成立恩智浦半导体公司。

“采用MI芯片的IC卡全球有10亿张，在中国约有1.4-1.5亿张”，来自石践的数据显示MI芯片在全球IC市场的占有率在八成左右。

破解IC卡最慢仅需三小时

“实际这已是公开的东西”，石践告诉腾讯科技对于这套破解的方法，只要是有些基础的或者业内人员就可以很快搞明白。

据悉，生产IC的是一种标准的机器，个人就可以买到，并不需要非常复杂的手续。

可升级为人脸识别系统应对

“威胁主要在公交卡、门禁卡”等方面，石践说广州已经出现私下复制、更改IC卡内容，为公交卡进行充值的现象了，“这个影响非常恐怖”。

而解决的办法目前只能是等待系统升级。石践表示对于安全系数要求更高的单位，可以考虑采用生物特征进行加密的系统。

内地1.4亿IC卡安全有隐患：掌握技术可修改钱数

2009年04月10日 02:54北京晨报【大中小】【打印】已有评论0条

日前，工业和信息部发布了《关于做好应对部分IC卡出现严重安全漏洞工作的通知》，要求各地各机关和部门开展对IC卡使用情况的调查及应对工作。工信部的这则通知的背景是什么呢？北京晨报记者从专业人士处获知了惊人的消息:主要应用于IC卡系统的MI芯片的安全算法已遭到破解！目前全国170个城市的约1.4亿张应用此技术的IC卡也都将面临巨大的安全隐患。

事件：命门被破

IC卡

专家不公布破解方法

卡的加密与解密永远都是一枚硬币的两面，每天在世界的各个地方，都有无数的科学家、学者和黑客们不停地在研究着各种加密和解密的技术与技巧，成功与失败也在永远上演着。2008年，德国研究员亨里克·普洛茨和美国弗吉尼亚大学计算机科学在读博士卡尔斯滕·诺尔就享受到了成功的喜悦:他们最先利用电脑成功破解了恩智浦半导体的Mifare经典芯片(简称MI芯片)的安全算法。但对于这项科研成果，亨里克·普洛茨和卡尔斯滕·诺尔却在第一时间宣布绝不公布其破解的成果。那么，他们为何对自己的科研成果讳莫如深呢？原来，他们所破解的MI芯片的安全算法，正是目前全世界应用最广泛的非接触IC卡的安全算法！可以想见，如果这一科研成果被人恶意利用，那么大多数门禁系统都将失去存在的意义，而其他应用此种技术的IC卡，如各高校学生使用的校园一卡通、高速公路缴费一卡通等也都将面临巨大的安全隐患。

掌握技术可修改钱数

看过美国大片的人都有印象，片中的主角可以利用电脑轻易复制各种卡，随意出入机关重重的机密场所。MI芯片的安全算法如果被不法分子利用，现实生活中，影片主角的行为可能就发生在你身边。

中国信息产业商会智能卡专业委员会理事长潘利华教授对记者表示，“Mifare非接触逻辑加密卡的安全性目前确实存在一定问题，这对目前广泛应用的城市IC卡项目是一个不小的隐患。”据潘教授介绍，一旦不法分子攻破这一技术，“就可以随意修改自己卡内的信息，比如原来充值10元，他就可以随意改成1000元或者10000元，甚至更多。而且一旦复制了别人的卡片，也可以随意修改信息。这对广大持卡人来说，无疑是极大的安全隐患。”

影响范围涉及公共领域

据建设部IC卡应用服务中心常务副主任马虹介绍，“截至2008年11月，全国共有170余个城市在公共交通领域建立了不同规模的IC卡应用系统，比2007年增长30%，90%的城市实现了城市IC卡的一卡多用。”值得注意的是，目前，这部分城市均采用逻辑加密卡。Mifare算法遭破解后，引发业界高度关注。潘利华教授认为，卡片信息一旦遭到大规模复制，可能给城市公共事业造成极大安全隐患。所以目前，如何解决这个安全漏洞，是业内和相关部门亟待思考及解决的问题。

目前城市IC卡正在向“一卡多用”方向发展。在北京，IC卡已扩展到了超市消费领域，“一旦有不法分子复制卡片后恶意充值倒卖或自己消费，后果将不堪设想。”潘教授也对此表示了自己的担忧。

应对：等待升级

IC卡

政府推广CPU卡

据了解，截至目前，我国170余个城市应用了不同规模的公用事业IC卡系统，发卡量已超过1.5亿张，约有95%的城市在应用IC卡系统时选择使用非接触式逻辑加密卡，相当于约有1.4亿张逻辑加密卡在我国城市公用事业IC卡系统中应用。其应用范围已覆盖公交、地铁、出租、轮渡、自来水、燃气、风景园林及小额消费等领域。Mifare算法遭破解后给我国城市公用事业IC卡系统安全性造成了极大隐患。卡片信息一旦遭到大规模复制，IC卡系统将会面临巨大的威胁。如何从根本上解决这个安全漏洞，防止危害的发生，是目前行业亟待思考及解决的重要问题。

“有效防范Mifare算法破解的根本解决方案就是升级改造现有IC卡系统，并逐步将逻辑加密卡替换为CPU卡。”据潘教授介绍。相比逻辑加密卡，虽然CPU卡出现的时间较晚，但因为CPU卡拥有独立的CPU处理器和芯片操作系统，可以更灵活的支持各种不同的应用需求，交易也更安全。“其生命力也更旺盛，优势主要体现在几个方面:数据安全性更高、应用更加灵活、应用能力更多。”握奇公司的周斌告诉记者。“目前，欧洲各国的银行卡都是采用CPU技术，至今未出现被破解和攻击等恶性事件，充分说明了CPU卡的优势所在。”潘教授告诉记者。

据潘教授介绍，升级后的新版牡丹交通卡是一张集芯片和磁条于一身的银联标准贷记卡，从卡种上来说，就是典型的接触式CPU卡，安全性也大大提高。从外观上来说，接触式CPU卡与非接触式CPU卡区别就是表面多一个芯片，而非接触式CPU卡与目前的逻辑加密卡看不出区别。

为改变目前逻辑加密卡的安全隐患现状，大力推进CPU卡应用就显得势在必行。而国家相关部门也做了大量工作，建设部IC卡应用服务中心在2008年就制订了两项国家行业标准，同时向建设部申报三项城市互联互通卡系列标准，为CPU卡的广泛应用及实现全国互联互通建立标准体系。

记者了解到，目前包括北京在内的全国各大城市都在稳步推进向CPU卡的过渡工作，一些新上项目都被要求采用CPU卡系统，而放弃使用目前的逻辑加密卡。业内一位知情人透露，山东淄博的城市一卡通项目就是一个例子，“原本打算50万张逻辑加密卡的招标工作都已结束，但随后还是决定从长远利益出发，重新招标使用CPU卡。”

专家释疑

IC卡

IC公交卡北京最安全

“安全算法可以被攻破，但只要从系统层面考虑周密，同时注重系统的安全建设，建设时留有足够的可升级和可拓展空间，那么卡片算法破解后出现的一系列安全隐患都将得到有效遏制。这就如同电脑遭到攻击后，我们需要下载补丁升级一样。对于大众而言，没有必要过于担心，更没有必要弃卡不用。”有关专家也表示。

日前在接受北京晨报记者采访时，北京市政交通一卡通公司总经理汪连启表示，目前，香港、上海等城市大部分采用的是S50卡，而北京现在已全部置换成S70卡，拥有48种功能，在全国的IC卡里，北京交通一卡通的安全等级是最高的。而根据国家有关部门的要求，北京正在逐步启用CPU卡，逐渐取代现在的S70逻辑加密卡。汪连启告诉北京晨报记者:“目前市民在一卡通售卡点买到的卡都已经是CPU卡。而原先市民手中的S70卡仍旧可以使用，不会要求市民在较短时间内换卡。”汪总同时表示，“市民手头的S70卡也没有必要去更换，因为S70卡目前在全国还是最先进的，安全性能也是最高的。”

=====================================

解读：

美国弗吉尼亚大学计算机科学在读博士卡尔斯滕·诺尔 Karsten Nohl

主页：http://www.cs.virginia.edu/~kn5f/

2009年4月1日星期三

dist-sys-susx.ac.uk

source: http://www.cogs.susx.ac.uk/courses/dist-sys/node1.html

2.1 Introduction

Main Points

What we are going to cover
What distributed systems are
What characteristics well-designed systems have

2.1.1 Course Outline

What are we going to do?

1 lectures on computer communications and networks and loosely
coupled systems, revising material from the Multimedia
Communications Technology course. The Postgraduate course will get
an additional two hours in their seminar slot.
5 lectures on Remote Procedure Call, distributed objects and
classic distributed systems such as NFS
3 lectures on integrating distributed systems through the web,
content distribution networks and Peer to Peer computing.
5 lectures on closely coupled systems for distributed
transactions
1 lecture on recent advances in networks and distributed
systems

2.1.2 What's a Distributed System

A distributed system: physically separate computers working
together

Cheaper and easier to build lots of simple computers
Easier to add power incrementally
Machines may necessarily be remote, but system should work together
Higher availability - one computer crashes, others carry on
working
Better reliability - store data in multiple locations
More security - each piece easier to secure to right level.

2.1.2.1 The real world...

In real life, can get:

Worse availability - every machine must be up. ``A distributed
system is one where some machine you've never heard of fails and
prevents you from working''
Worse reliability
Worse security

Problem: Coordination is more difficult because multiple
people involved, and communication is over network.

Your Task: What are the distributed interactions when you
login at an x-terminal?

2.1.2.2 Interactions at an X terminal

Simplified interactions

2.1.3 Example Distributed Systems

Electronic MailMail delivered to remote mailbox. Requires
global name space to identify users, transport mechanisms to get
mail to mailbox
Distributed Information - WWWRemote information hidden below
hypertext browser. Caching and other features operate transparently
Distributed File SystemFiles stored on many machines,
generally not machine you're working on. Files accessed
transparently by OS knowing they're remote and doing remote
operations on them such as read and write e.g. Network File System
(NFS)
Trading Floor SystemBids made, stocks sold, screens updated.

(Source:The Eight Fallacies of Distributed Computing -
Peter Deutsch

2.1.4 What do we want from a Distributed System?

Resource Sharing
Openness
Concurrency
Scalability
Fault Tolerance
Transparency

Your Task: order the importance of each of these features for the
example systems in the previous slide.

2.1.5 Elements of a Distributed System

Another way to view a system...

Communications system
Messages
Machines
Processes on Machines
Programs
People

2.1.6 Conclusion

Its difficult to design a good distributed system: there are a
lot of problems in getting ``good'' characteristics, not the least of
which is people
Over the next ten weeks you will gain some insight into how to
design a good distributed system.

2.2 Bits and Bytes

Main Points

Bits and Bytes
Kilo, Mega, Giga, Tera
Memory, Packets and Bit Manipulation

2.2.1 Bits, Bytes, Integers etc

All data in computers is held as a sequence of ones and zeros.
A number is represented as the base 2 number held in some
pre-ordained fixed number of bits
Almost all other data is represented as various sets of
numbers.
A byte is 8 bits sequenced together - what is the
maximum number?
Integers (in Java and most other languages nowadays)
are 32 bits long - what is the maximum number?

2.2.2 Prefices

In communications we talk often about throughput in
bits/second, or moving files of some particular size. We use
magnitude prefices for convenience
kilo1000 x

mega1000000 x

giga
10⁹ x

tera
10¹² x
There is often confusion as to whether a kilobyte is 1000 or
1024 bytes. When dealing with processor architectures, its
generally 1024. When dealing with communications, its
generally 1000. State assumptions if it is not obvious from
context.

2.2.3 Memory and Packets

A computer stores data as bits in memory.
When it wants to send this data to another computer, it copies
the bits into the memory of the communications device.
The communications device sends the bits through the network
to the other machine (we'll cover the details of this in the
coming week).
The other machine's communication device places the bits
into memory which the other machine can access.
The tricky bits come in ensuring that both machines
interpret the bits correctly.

2.2.4 Bit Manipulation

Not only must the data be sent, but accompnying information
allowing the computers to interpret the context of the data.
Communications software must be able to pick out arbitrary bits
from an opaque bit sequence and interpret their meaning.
We do this using bitwise operations - and, or, exclusive or,
negation.

2.3 Foundations of Distributed Systems

Aim: How do we send messages between computers across a network?
Physical Concepts:Bandwidth, Latency
Packet Concepts:Data, Headers
Routing Concepts:Shared Media, Switches and Routing Tables,
Routing Protocols

For more detailed notes, see Dan Chalmers' Multimedia Communications Technology notes in http://www.informatics.sussex.ac.uk/courses/mct/.

2.3.1 Physical Concepts

What is a message?

A piece of information which needs to move from one process to
each other eg a request to open a file, an email message, the
results of a remote print request.

For the network, this is a just a sequence of bits in memory.

Need to communicate this sequence of bits across communications
system to other host.

Signal to other side whether each bit in sequence is
0 or 1.

To communicate, need each end to each access a common substrate, and
then for the sender to change the value of a physical characteristic
of the substrate.

Examples - voltage level of a piece of wire, light level in optical
fibre, frequency of radio wave in air

2.3.1.1 Signal characteristics

If physical characteristic has two levels then signal is binary.

If it has more than one level, can encode a number of bits to each
level.

If 4 levels, then level 0 = 00, level 1 = 01, level 2 = 10, level 3 =
11

If 8 levels, how many bits can be encoded?

2.3.1.2 Bandwidth

Adjusting the level at one end, and recognising the level has changed
at the other takes finite time.

The finite time limits the speed at which bits can be signalled.

The rate of signalling is the bandwidth, measured in bits per second.

If it takes 20 microseconds to raise the voltage on a wire, and 10
milliseconds to recognise the new level, what is the possible
bandwidth if there are two levels? Eight levels?

2.3.1.3 Noise and errors

Can we get infinite bandwidth by increasing the number of levels?

No, since noise makes it impossible to correctly distinguish level.

Noise is random changes in the physical characteristic to which all
physical phenomena are prone.

Always some probability that level will be misinterpreted.

Always some probability of error in message.

Goal of communications engineers is to make this probabilty as low as
necessary.

2.3.1.4 Latency

Does signal propagate along wire infinitely fast?

No, limit to speed of propagation of light. Hence described
as propagation delay.

Latency is time taken for signal to travel from one end of
communication system to destination.

Since communication system may reconstruct message at intermediate
points, time taken to reconstruct message is also part of latency.
Known as switching delay

Your Task: Describe the bandwidth, propagation delay and
switching delay in a game of chinese whispers.

2.3.2 Packets

If there is an error in a message, how is it detected and rectified?

Compute a checksum over the message
send the checksum with the message
Calculate a new checksum over the received message
Compare the checksums - if different, then message is in error
Ask for the message to be resent

Probability of error rises with length of message,

Thus message sent in separate lumps with maximum number of bits
per lump, known as packets.

If the message fits in one packet, good. Otherwise message is in many
packets.

2.3.2.1 Addressing

How do we direct the packet to correct recipient(s)?

Put header bits on front of packet, analogous to address and
other information on envelope in postal system. Add source of
packet to allow returns
Destination | Source | Packet body
If destination and source share the same physical medium, then
destination can listen for its own address (ethernet, wireless,
single point to point link)
If they don't share the same LAN, we use routing

2.3.2.2 Names, addresses and routes

NameIdentifier of object eg Ian Wakeman, Mum
AddressLocation of object eg Rm 4C6, COGS
RouteHow to get there from here - ``turn left, first right and
up stairs, first door on left''

Internet: Name is a Domain Name, such as www.cogs.susx.ac.uk. More
later.

To get packet through network, turn name into address by asking Domain
Name Service (DNS).

Place address in packet header and hand to nearest router.

Router locates MAC address corresponding to IP address, if they're on
the same LAN.

2.3.2.3 Indirect Addressing

$\includegraphics{figs/indirect-addresses}$

2.3.2.4 Architecture of a switch

$\includegraphics{figs/switch}$

A switch is a specialised computer. Can turn a PC into a router,
using free software.

A packet arrives on a link
The switch gets the destination of the packet from the packet
header
The switch looks up the destination in a routing table in
memory and discovers the output link
The switch queues the packet to be sent on the output link

2.3.2.5 Distributed Routing Table Maintenance

The situation:

People are network administrators (and end users)
Communication systems are links (possibly multipoint)
Machines are switches
Messages may be lost

The problem: Given an address, how does a router know which output
link to send the packet on?

Choices:

Packet could contain list of all output links - source routing.
Requires source to lookup and determine route. May be good thing.
Router could look up address in local routing table, and send out of
corresponding link. If not in table, send out default link.

How do we construct tables? Could install entries by hand, known as
static routing.

But

Limited to entries people know about.
Unable to ensure consistency and absence of loops
Unable to respond to changing topology, sharks gnawing through undersea
cables etc.
Internet has no centralised authority

So we use distributed routing algorithm

2.3.2.6 Distance Vector Routing

Each switch knows its own address
Each link has''cost'', such as a value of 1 per link, or measure
of delay.
Each switch starts out with distance vector, consisting of 0 for
itself, and infinity for everyone else
Switches exchange distance vectors with neighbour switches, and
whenever info changes
Switch saves most recent vector from neighbours
Switch calculates own distance vector by examining cost from
neighbour and adding cost of link to neighbour
Use link with minimum cost to destination as link to route out.

Examples include RIP and BGP.

2.3.2.7 Link State Routing

Each switch knows addresses that are direct neighbours
Switch constructs packets saying who are neighbours - link
state packets.
Link state packets flooded to all other switches
Switch constructs complete graph using most recent link state
packets from all other switches
Use Dijkstra shortest path to figure out routing table.

Examples include OSPF.

2.3.3 Conclusion: Network Properties

Packet Switched Networks present certain fundamental problems to the distributed systems programmer:

Whilst switches converge on a consistent set of routes, packets
will bounce around in the network, and suffer delays or get dropped.
Switches shared with other packets, therefore get queues.
Total latency is therefore variable.
Loss rate is variable (noise, available switch buffers).
Queue size management (aka congestion control) changes the bandwidth available to machines. Therefore bandwidth is variable.

2.8 Computer Security: Why you should never trust a
computer system

Goal: Prevent Misuse of computers

Definitions
Authentication
Private and Public Key Encryption
Access Control and Capabilities
Enforcement of security policies
Examples of Security Problems

2.8.1 Definitions

Types of Misuse

Accidental
Intentional

Protection is to prevent either accidental or intentional misuse

Security is to prevent intentional misuse

Three pieces to security

AuthenticationWho user is
AuthorisationWho is allowed to do what
EnforcementEnsure that people only do what they are allowed to do

A loophole in any of these can cause problem eg

Log in as super-user
Log in as anyone, do anything
Can you trust software to make decisions about 1 and 2?

2.8.2 Authentication

Common approach: Passwords. Shared secret between two parties. Since
only I know password, machine can assume it is me.

Problem 1 system must keep copy of secret, to check against
password. What if malicious user gains access to this list of
passwords?

Encryption Transformation on data that is difficult to
reverse - in particular, secure digest functions.

2.8.2.1 Secure Digest Functions

A secure digest function h = H(M) has the following properties:

Given M, is is easy to compute h.
Given h, it is hard to compute M.
Given M, it is hard to compute M' such that
H(M) = H(M')

For example: Unix /etc/passwd file

Password
$\rightarrow$ one way transform
$\rightarrow$ encrypted
password

System stores only encrypted version, so ok if someone reads the file.
When you type in password, system encrypts password and compares
against the stored encrypted versions.

Over the years, password protection has evolved from DES, using a
well-known string as the input data, and the password as the key,
through to MD5 to SHA-1.

2.8.2.2 Passwords as Human Factors Problem

Passwords must be long and obscure.

Paradox: short passwords are easy to crack, but long
ones, people write down

Improving technology means we have to use longer passwords. Consider
that unix initially required only lowercase 5 letter passwords

How long for an exhaustive search?
26⁵ = 10, 000, 000

In 1975, 10 ms to check a password
$\rightarrow$ 1 day
In 2003, 0.00001 ms to check a password
$\rightarrow$ 0.1 second

Most people choose even simpler passwords such as English words - it takes
even less time to check for all words in a dictionary.

2.8.2.3 Some solutions

Extend everyone's password with a unique number (stored in
passwd file). Can't crack multiple passwords at a time.
Require more complex passwords, eg 7 letter with lower, upper,
number and special
70⁷ $\approx$ 8000 billion, or 1 day. But
people pick common patterns eg 6 lower case plus number.
Make it take a long time to check each password. For example,
delay every login attempt by 1 second.
Assign long passwords. Give everyone a calculator or smart card
to carry around to remember password, with PIN to activate. Need
physical theft to steal card.

Problem 3 Can you trust the encryption algorithm? Recent
example: techniques thought to be safe such as DES, have back doors
(intentional or unintentional). If there is a back door, you don't
need to do complete exhaustive search.

2.8.3 Authentication in distributed systems: Private Key Encryption

In distributed systems, the network between the machine on which the
password is typed and the machine the password is authenticating on is
accessible to everyone.

Two roles for encryption:

Authentication - do we share the same secret?
Secrecy - I don't want anyone to know this data (eg medical
records)

Use an encryption algorithm that can easily be reversed given the
correct key, and difficult to reverse without the key

$\includegraphics{figs/private}$

From cipher text, can't decode without password.
From plain text and cipher text, can't derive password.
As long as the password stays secret, we get both secrecy and
authentication.

2.8.3.1 Symmetric Encryption

Symmetric encryptions use exclusive ors, addition, multiplication,
shifts and transpositions, all of which are fast on modern processors.
DESThe Data Encryption Standard (DES) was released in 1977.
The 56 bit key is too weak for most uses now, and instead, 3DES or
triple DES is used, which has 128 bit keys.
IDEAThe International Data Encryption Algorithm has 128 bit
keys, and has proven strong against a large body of analysis.
AESThe US based NIST has defined a new encryption algorithm
based on Rijndael, which offers 128, 192 or 256 bit keys.

2.8.3.2 Authentication Server - Kerberos Operation

The server keeps a list of passwords, provides a way for two parties,
A, B to talk to each other, as long as they trust the server.

Notation: Kxy is a key for talking between x and y. (..) $\wedge$ K means
encrypt message (..) with key K.

Simplistic overview:

A asks server for key:

A
$\rightarrow$ S (Hi! I'd like a key for A,B)
Server gives back special ``session'' key encrypted in A's key, along
with ticket to give to B:

S
$\rightarrow$ A (Use Kab (This is A! Use Kab) $\wedge$ Ksb) $\wedge$ Ksa
A gives B the ticket:

A
$\rightarrow$ B ((This is A! Use Kab) $\wedge$ Ksb

Details: add in timestamps to limit how long each key exists, use
single use authenticators so that clients are sure of server
identities, and to prevent machine replaying messages later.

Also have to include encrypted checksums to prevent malicious user
inserting garbage into message.

2.8.4 Public Key Encryption

Public key encryption is a much slower alternative to private key;
separates authentication from secrecy.

Each key is a pair K,K-1

With private key: (text) $\wedge$ K $\wedge$ K = text

With public key:

(text) $\wedge$ K $\wedge$ K-1 = text,
but (text) $\wedge$ K $\wedge$ K $\neq$ text
(text) $\wedge$ K-1 $\wedge$ K = text,
but (text) $\wedge$ K-1 $\wedge$ K-1 $\neq$ text

Can't derive K from K-1, or vice versa

2.8.4.1 Public key directory

Idea: K is kept secret, K-1 made public, such as public directory

For example: (I'm Ian) $\wedge$ K Everyone can read it, but only I
could have sent it (authentication).

(Hi!) $\wedge$ K-1 Anyone can send it but only I can read it
(secrecy).

((I'm Ian) $\wedge$ K Hi!) $\wedge$ K'-1

On first glance, only I can send it, only you can read it.
What's wrong with this assumption?

Problem: How do you trust the dictionary of public keys? Maybe
somebody lied to you in giving you a key.

2.8.5 Secure Socket Layer

Provides a techniques for data sent over a TCP connection to be
encrypted.
Uses public key technology to agree upon a key, then 3DES or
whatever to encrypt the session.
Data encrypted in blocks, optionally with compression first.
Used in http as https, and for telnet, ftp etc.

2.8.5.1 SSL handshake protocol

$\includegraphics{figs/ssl}$

2.8.6 Authorisation

Who can do what...

Access control matrix: formalisation of all the permissions in the
system

objects	file1	file2	file3	...
users
A	rw	r
B		rw
C			r
...

For example, one box represents C can read file3

Potentially huge number of users and objects, so impractical to store
all of these.

2.8.6.1 Approaches to authorisation

Access control list - store all permissions for all users with
each object

Still might be large number of users. Unix addresses by having rwx
for user group world. Recent systems provide way of specifying
groups of users and permissions for each group
Capability list - each process stores tickets for the objects it
can operate on.

2.8.6.2 Digital Rights Management

Digital content is produced by people earning a living, and they
wish to protect their investment
Digital Rights Management is the catchall term for the
technologies controlling the use of digital content.
Two main approaches:
Containmentin which the content is wrapped is an encrypted
shell, so that users have to prove their capability of
using the content through knowledge of the key.

Watermarkingwhere the content is marked so that devices know
that the data is protected.
The problem is how to enforce the checking of
capabilities and enforce the no circumvention requirement.

2.8.7 Enforcement

Enforcer is the program that checks passwords, access control lists,
etc

Any bug in enforcer means way for malicious user to gain ability to do
anything

In Unix, superuser has all the powers of the Unix kernel - can do
anything. Because of coarse-grained access control, lots of stuff has
to run as superuser to work. If bug in any program, you're in
trouble.

Paradox:

make enforcer as small as possible - easier to get correct, but
simple minded protection model (more programs have high privilege).
fancy protection - only minimal privilege, but hard to get
right.

2.8.8 Trusted Computing Platform

Question: How do we ensure no software transgresses digital
rights?

Answer: By only allowing approved software to have access to
the data.

The basic computer and operating system must be running
validated software.
There must be an untamperable part of the computer/OS that can
hold keys and hand over only to validated software.
The untamperable part of the computer is the Fritz chip, a
co-processor which holds a unique certificate that it is running
some validated code

2.9 Names and Naming Services

2.9.1 Main Points

Use of names
Structure of names
Name Services
Domain Name Service (DNS) - The Internet Name Service

Definitions:-

Names- what its called
Address- where it is
Route- how to get there

2.9.2 Why names?

Object Identificationa service or resource we want to use, eg
a filename, a telecommunications provider, a person
Allow SharingCommunicating processes can pass names and thus
share resources
Location IndependenceIf we separate the name from the address,
can migrate object transparently
SecurityIf large number of possible names, knowing the name of
the object means that it must explicitly have been passed. If
entire system constructed so that names are passed with
authorisation then knowing name means chain of trust to allow
access to object.

2.9.3 What does one do with names?

Use as arguments to functions, eg to call a service
Create names. If an object comes into creation it must be
named. Name should meet rules for system, eg be unique, therefore
naming authority (possibly object) must keep track of what names it
has allocated
Delete names. When an object disappears from the system, at
some point may wish to delete name and allow re-use.

2.9.4 What's a name?

Part of the design space for any distributed system is how to
construct the name.
Choice of design has ramifications for design of rest of system
and performance of service.

2.9.4.1 Unique Identifier

Unique Identifier (UID) aka flat names, primitive names.
No internal structure, just string of bits.
Only use is comparison against other UIDs eg for lookup in table
containing information on named object.
Provide location independence and uniformity
But Difficult to name different versions of the same
object eg Distributed systems edition 1 vs Distributed Systems
edition 2. Real objects have relationships to each other - useful
to reflect in naming practice
Difficult to discover address from name - need to search entire
name space

2.9.5 Partitioned names

Add partition structure to name to enable some function to be
more efficient, typically location and name allocation

$\includegraphics{figs/dns}$

Domain Name Service (DNS) name - unix.rn.informatics.scitech.sussex.ac.uk.
Each part of name comes from flat name space. Division of name space
and possible objects into smaller space. "uk" reduces objects to
those within the uk, "ac" to those administered by academic
networks, and so on.
When allocating names, simply add to lowest part of partition.
Low risk of collision, since small number of other objects, all
administered by same authority
When looking up name, can guess where information will reside.

2.9.6 Descriptive names

Necessary to have a unique name.
But useful to name objects in different ways, such as by service
they offer eg Postman Pat, John of Gwent, www.informatics.sussex.ac.uk.
Create name using attributes of object.
Note that objects can therefore have multiple names, not all of
which are unique
Choice of name structure depends on system. DNS chooses
partition according to administration of creation, with aliases to
allow naming by service eg ftp.informatics.sussex.ac.uk is also
doc-sun.rn.informatics.scitech.sussex.ac.uk

2.9.7 Object Location from name - broadcast

Ask all possible objects within the system if they respond to
that name. Can be efficient if network supports broadcast eg
Ethernet and other LANs.
Equivalent to distributing name table across all objects,
objects storing names referring to them.
Only want positive responses - all responses would generate a
lot of traffic.
Scaling problem when move into wide area.
- Greater number of hosts imply greater probability of failure
- Broadcasts consume higher proportion of bandwidth, made worse
  due to higher failures needing more location requests
Broadcast used only on small LAN based systems (and in initial
location of directory)

2.9.8 Location through database

Keep information about object in a table, indexed by a name. Location
is just another piece of information.
In DNS, table is stored as {name,attribute} pairing, in a resource record
If database centralised, then
- Whole system fails if database machine fails
- Database machine acts as bottleneck for performance of system as
  whole
- In wide area systems, authority should be shared amongst
  controlling organisations
So name information usually distributed.

2.9.9 Distributed Name Servers

Parts of the name table are distributed to different servers eg. in
Domain Name Service, servers are allocated portions of the name space
below certain domains such as the root, ac.uk, susx.ac.uk

Names can be partitioned between servers based on

algorithmic clusteringeg apply well-known hash function on name to map to server. May
result in server being remote from object. Only technique for UIDs
structural clusteringif name is structured, use structure to designate names to particular server, such as in DNS
attribute clusteringif names are constructed using attributes, then servers take responsibility for certain attributes.

2.9.10 Availability and performance

If a single server is responsible for name space, there is still single
point of failure, and a performance bottleneck.

Most systems therefore

Replicate name list to other servers. Also increases
performance for heavily accessed parts of name space eg secondary
servers in DNS
Cache information received by lookup. No need to repeat lookup
if asked for same information again. Increases performance.
Implemented in both client side and server (in recursive calls) in
DNS.

If information is cached, how do we know when its invalid? May
attempt to use inconsistent information.

2.9.11 Maintaining consistency for distributed name services

Alleviated by following features of some distributed systems

In most systems objects change slowly, so names live for a long
time, and are created infrequently
If address of an object is wrong, it causes an error. Address user
can recover if it assumes one of the possible problems is inconsistent
information.
Obsolete information can be fixed by addressed object leaving
redirection pointers. Equivalent to leaving a forwarding address to new home.

However, there are always systems which break these assumptions eg
highly dynamic distributed object system, creating lots of objects and
names and deleting lots of names.

2.9.12 Client and Name server interaction.

Hidden behind RPC interface.
Client knows of server to ask (either installed in file, or
through broadcast location).
Client calls with arguments of name and required attribute, eg
address. In DNS arguments are name and the type of requested
attribute.
Server will return result with either required attribute or
error message

2.9.12.1 Lookup Modes

If name not stored on server, may be stored on other server. Two options

Recursiveserver asks other possible server about name and
attribute, which may then have to ask another server and so on

$\includegraphics{figs/recursive}$

Iterativeserver returns address of other possible server to client, who then resends request to new server.

$\includegraphics{figs/iterative}$

2.9.13 Summary

Names can be flat, or they can be structured
Centralised name servers suffer from availability - distributed
name servers suffer from inconsistency
Interaction with name server best modelled by RPC

2.10 Distributed File Systems

2.10.1 Main Points

A Distributed File System provides transparent access to files
stored on a remote disk

Themes:

Failures - what happens when server crashes but client doesn't?
or vice versa
Performance
$\Rightarrow$ Caching - use caching at both the
clients and the server to improve performance
Cache Coherence - how do we make sure each client sees most up
to date copy of file?

2.10.2 Client Implementation

Request for operation on file goes to OS.
OS recognises file is remote and constructs RPC to remote file server
Server receives call, does operation and returns results.
Subtrees in directory structure generally map to file system.
Provides access transparency

$\includegraphics{figs/remote-dir}$

2.10.3 No Caching

Simple approach: Use RPC to forward each file system request to remote
server (older versions of Novell Netware).
Example operations: open, seek, read, write, close
Server implements operations as it would for local request and passes
result back to client

2.10.3.1 Advantages and Disadvantages of uncached remote file
service

Advantageserver provides consistent view of file system to both A
and B.
Disadvantagecan be lousy performance

Going over network is slower than going through memory
Lots of network traffic - congestion
Server can be a bottleneck - what if lots of clients

2.10.4 NFS - Sun Network File System

Main idea - uses caching to reduce network load

Cache file blocks, file headers etc in both client and
servers memory.
More recent NFS implementations use a disk cache at the client
as well.

$\includegraphics{figs/nfs-fs}$

Advantage

Advantage: If open/read/write/close can be done locally, no network
traffic

Issues: failure and cache consistency

2.10.4.1 Failures

What if server crashes? Can client wait until server comes back up
and continue as before?

Any data in server memory but not yet on disk can be lost
If there is shared state across RPCs, eg open seek read. What
if server crashes after seek? Then when client does ``read'' it will
fail.
Message re-tries - suppose server crashes after it does ``rm
foo'' but before it sends acknowledgement? Message system will
retry and send it again. How does system know not to delete it
again (someone else may have created new ``foo''). Could use
transactions, but NFS takes more ad hoc approach.

What if client crashes?

Might lose modified data in client cache.

2.10.4.2 NFS Protocol

Write-through caching - when a file is modified, all modified
blocks are sent immediately to the server disk. To the client,
``write'' doesn't return until all bytes are stored on disk.
Stateless Protocol - server keeps no state about client, except
as hints to improve performance
- Each read request gives enough information to do entire operation -
  ReadAt(inumber, position) not Read(openFile).
- When server crashes and restarts, can start processing requests
  immediately, as if nothing had happened.
Operations are idempotent: all requests are ok to repeat
(all requests are done at least once). So if server crashes between
disk I/O and message send, client can resend message and server
does request over again
- Read and write file blocks are easy - just re-read or re-write file
  block, so no side-effects.
- What about ``remove''? NFS just ignores this - does the remove twice
  and returns error if file not found, application breaks if
  inadvertently removes other file.
Failures are transparent to client system.

Is this good idea? What should happen if server crashes? If
application in middle of reading file when server crashes, options:
1. Hang until server returns (next week...)
2. return an error? But networked file service is transparent.
  Application doesn't know network is there. Many Unix Apps ignore
  errors and crash if there is a problem.
NFS has both options - the administrator can select which one.
Usually hang and return error if absolutely necessary.

2.10.4.3 Cache consistency

What if multiple clients sharing same files? Easy if they are both
reading - each gets a local copy in their cache.
What if one writing? How do updates happen?
Note NFS has write-through cache policy. If one client modifies file,
writes through to server.
How does other client find out about change?

2.10.4.4 NFS and weak consistency

In NFS, client polls server periodically to check if file has changed.
Polls server if data hasn't been checked in every 3-30 seconds (Exact
time is tunable parameter)

$\includegraphics{figs/nfs-cache}$
When file changed on one client, server is notified, but other clients
use old version of file till timeout. They then check server and get
new version.
What if multiple clients write to same file? In NFS get either
version or mixed version. Completely arbitrary!

2.10.4.5 Sequential ordering constraints

What should happen? If one CPU changes file, and before it completes,
another CPU reads file?
We want isolation between operations, so read
should get old file if it completes before write starts, new version if it
starts after write completes. Either all new or all old any other
way cf serialisability.

2.10.4.6 NFS Pros and Cons

its simple
its highly portable
but its sometimes inconsistent
but it doesn't scale to large numbers of clients

Note that this describes NFS version 3.

2.10.5 Andrew File System

AFS (CMU late 80s)
$\Rightarrow$ DCE DFS (commercial product)

Callbacks: Server records who has copies of file
Write through on close
- If file changes, server is updated (on close)
- Server then immediately tells all those with old copy

2.10.5.1 AFS Session Semantics

Session semantics - updates only visible on close

In Unix (single machine), updates visible immediately to other
programs who have file open.
In AFS, everyone who has file sees old version, anyone who opens file
again will see new version.

In AFS:
1. on open and cache miss: get file from server, set up callback
2. on write close: send copy to server; tells all clients with
  copies to fetch new version from server on next open
Files cached on local disk; NFS (used to) cache only in memory

$\includegraphics{figs/AFS-cache}$

What if server crashes? Lose all your callback state.
Reconstruct information from client - go ask everyone ``who has which
files cached''

2.10.5.2 AFS Pros and Cons

Disk as cache
$\Rightarrow$ more files cached locally
Callbacks
$\Rightarrow$ server not involved if file is
read-only (Majority of file access is read-only)
But on fast LANs, local disk is slower than remote memory

NFS version 4 will provide session semantics when it is
deployed by vendors in the 2005 timeframe.

2.10.6 Summary

Remote file performance needs caching to get decent performance.
Central server is a bottleneck
- Performance bottleneck:
  - All data is written through to server
  - all cache misses go to server
- Availability Bottleneck
  - Server is single point of failure
- Cost bottleneck
  - Server machines high cost relative to workstation

2.12 Content Distribution Networks

Main Points

Building content caches
Pre-fetching data
Using your neighbours - BitTorrent

2.12.1 Getting Content over a Network

Image cdn-base

Users want to download content from serversas quickly as possible
What structures can we use to improve their experience, and the
impact upon the network?

2.12.2 Web Caches

Large organisations can improve web performance by sticking a
web cache in front of HTTP connections
The cache inspects an incoming HTTP request to see if it can
satisfy the request from locally cached objects.
If yes, then the object is returned, and the last reference time
is updated
If no, then the object is retrieved and copied locally if allowed.
Web objects can be marked as non-cacheable.
Caching is a major part of the HTTP standard

2.12.2.1 Cache Performance

The performance of a web cache is difficult to model, since the
performance is a complex mixture of interaction between TCP, HTTP
and content.
Caches work because of temporal locality, due to popularity of
content, and spatial locality, due to structure of HTML documents
Measurements of web proxies give the following results (based on
JANET caches)
- Request hit rate is about 60%.
- Volume hit rate is about 30%.
- Latency improvement is around a factor of 3 on average

2.12.2.2 Problems with Caching

Not all content is marked as cacheable, eg because the site
wants accurate records of who looks at content.
All hosts behind a cache appear to come from one
address.

Question: Why is this a problem?

2.12.3 Pre-fetching Data

Can we improve performance by pro-actively distributing content
to caches?
Yes...

2.12.3.1 Active Content Distribution

The html uses links to the cdn domain name eg akamai.com
The internet service provider has entries in their local DNS for
akamai.com pointing to a machine on the ISP's network.
Content will therefore be supplied from the local machine rather
than the original machine
Customer interaction improved.
Bandwidth requirements of servers reduced

2.12.4 Using your Peers: BitTorrent

Why not use the other people receiving the content?
BitTorrent downloads from other machines
Basic Idea:
- To download, the host contacts a machine tracking those already
  downloading the torrent.
- Peers are selected at random from the tracker.
- Pieces are selected to download from those available on the
  downloader's peer set until all pieces of the file have been received

2.12.4.1 The .torrent file

To start downloading from a torrent, the consumer must
first locate a .torrent file.
This contains information about the file length, name and
hashing numbers of the file blocks, and the url of a tracker
The file is split into 250 KByte pieces, each having a SHA1 hash
calculated.
A tracker holds the IP addresses of current downloaders

2.12.4.2 Locating peers

After receiving the torrent file, the downloader contacts the
tracker
The tracker inserts the downloader into its list of downloaders,
and returns a random list of other downloaders
This list becomes the downloader's peers
Question What is the shape of the overlay Graph?

2.12.4.3 Choosing Pieces

The downloader will contact its peers to discover what pieces
they have.
It then chooses a piece to download:
The first choice is made randomly, so as to spread load
Subsequent pieces are based on a rarest-first approach to
increase probability all pieces are available.
When a peer has downloaded a new piece which matches the SHA1
hash, it notifies its peers it has a new piece.

2.12.4.4 Choosing Downloaders

A request to upload is accepted if the requester recently
uploaded to it
This provides an incentive to machines to share data
Periodically other machines are tried for upload

2.12.5 Summary

Web caching improves performance by a reasonable factor,
dependent on situation
Pro-active content distribution can reduce latency and improve
bandwidth usage for popular services
BitTorrent can improve bandwidth usage by spreading load across
peers.

2.13 Replication: Availability and Consistency

Motivation for replication
Multicasting updates to a group of replicas
Total Ordering
Causal Ordering
Techniques for ordering protocols
ISIS CBCAST

2.13.1 What is Replication?

Multiple copies of dynamic state stored on multiple machines
eg Copies of files stored on different machines, name servers storing
name address mappings
Caching can be seen as a form of replication.

2.13.1.1 Why is Replication used?

Performance enhancement

Single Server acts as a bottleneck - if we can balance load
amongst multiple servers, get apparent performance gain
If clients are geographically distributed, we can site servers
near clients and reduce communication costs

Availability

If a machine fails, then we can still provide a service
Probability of total failure reduced such as all data being
lost, since data replicated across multiple machines
If probability of failure is pr(fail ) for a given machine
in n machines, then probability of loss of service is
pr(fail )ⁿ and the availability of the service is
1 - pr(fail )ⁿ
eg, if mean time between failure for 3 machines is 5 days,
repair time is four hours, then assuming independence of failure,

pr(fail )= ${\frac{{4}}{{5 \times 24}}}$ = 0.03.

Availability =
1 - 0.03³ = 99.996%

Fault ToleranceEven in the presence of failure, the service
will continue to give the correct service

Stronger than availability, since can provide real-time
guarantees (with extra work!)
Can protect against arbitrary failure where machines feed
wrong information (Byzantine Failure)

2.13.2 Issues in Replication

A collection of replicas should behave as if state was stored
at one single site

When accessed by client, view should be consistent
Replication should be transparent - client unaware that servers
are replicated

If we are providing a replica service, replica can be passive or
active.

Passive replicas are standbys, to maintain service on failure. No
performance improvement.

Standbys must monitor and copy state of active server

Provide availability in simple manner.

Used for highly available systems eg space applications

2.13.3 Consistency

Clients can modify resource on any of the replicas.
What happens if another client requests resource before replica
has informed others of modification, as in cache consistency in
distributed file systems?
Answer depends upon application...

2.13.3.1 Example Distributed Bulletin Board System (BBS)

$\includegraphics{figs/bbs}$

Users read and submit articles through Front End.
Articles replicated across a number of servers
Front Ends can connect to any server
Servers propagate articles between themselves so that all
servers hold copies of all articles.
User membership of a given bbs is tightly controlled.

Questions on BBS:

How should messages be passed between replicas?
Should order of presentation of articles to clients be the same
across all replicas? Are weaker ordering semantics possible?
When a client leaves bbs group, can they see articles submitted
after they have left? Is this desireable?
What should happen when replicas are temporarily partitioned?

2.13.4 Updating Server state

Clients read and update state at any of the replicated servers eg
submit messages in bbs. To
maintain consistency, three things are important

Multicast communicationMessages delivered to all servers in
the group replicating data
Ordering of messagesUpdates occur in the same ``order'' at
each server
Failure recoveryWhen servers or the network fails, and comes
back, the replicas must be able to regain consistency. Done through
Voting and Transactions (later in course)

2.13.5 Multicast and Process Groups

A Process Group: a collection of processes that co-operate
towards a common goal.

Multicast communication: One message is sent to the members of a
process group

Idea: Instead of knowing address of process, just need to know an
address representing the service. Lower levels take care of routing
messages.

Useful for:

Replicated ServicesOne update message goes to all replicas,
which perform identical operations. Reduces communication costs.
Locating objects in distributed servicesRequest for object
goes to all processes implementing service, but only process holding
object replies.

2.13.5.1 Group Services

Maintenance of group information is a complex function of the name
service (for tightly managed groups)

Create GroupCreate a group identifier that is globally
unique.
Join GroupJoin a group. Requires joining process information
to be disseminated to message routing function. May require
authentication and notification of existing members.
Leave GroupRemove a process from a group. May require
authentication, may occur as a result of failure or partition. Need
to notify message routing function, may notify other members.
Member ListSupply the list of processes within a group.
Needed for reliable message delivery, may require authentication.

2.13.6 Message Ordering

If two processes multicast to a group, the messages may be arbitrarily
ordered at any member of the group.

Process P1 multicasts message a to a group comprising processes P1, P2, P3 and
P4.

Process P2 multicasts message b to the same group

The order of arrival of a and b at members of the group can be different.

$\includegraphics{figs/order}$

2.13.6.1 Ordering example

$\includegraphics{figs/causal-object}$

Order of operations may be important - delete object, create
object.
If delete object arrives before create object, then operation
not completed

2.13.6.2 Ordering Definitions

Various definitions of order with increasing complexity in
multicasting protocol

FIFO OrderingMessages from one process are processed at all
group members in same order
Causal OrderingAll events which preceded the message
transmission at a process precede message reception at other
processes. Events are message receptions and transmissions.
Total OrderingMessages are processed at each group member in
the same order.
Sync OrderingFor a sync ordered message, either an event
occured before message reception at all processes, or after
message. Other events may be causally or totally ordered.

2.13.6.3 FIFO ordering

Achieved by process adding a sequence number to each message.

Group member orders incoming messages with respect to sequence number.

Applicable when each process state is separate, or operations don't
modify state, just add incremental updates or read.

2.13.6.4 Total Ordering

When several messages are sent to a group, all members of the group
receive the messages in the same order.

Two techniques for implementation:

Sequencer
Elect a special sequencing node. All messages are
sent to sequencer, who then sends messages onto replicas. FIFO
ordering from sequencer guarantees total ordering. Suffers from
single point of failure (recoverable by election) and bottleneck.

Holdback Queue
Received messages are not passed to the
application immediately, but are held in a holdback queue
until the ordering constraints are met.

2.13.6.4.1 Sequence Number Negotiation

Sender negotiates a largest
sequence number with all replicas.

Replicas store largest final sequence number yet seen F_max,
and largest proposed sequence number P_max
Sender sends all replicas message with temporary ID.
Each Replica i replies with suggested sequence number of
max(F_max, P_max) + 1. Suggested sequence number
provisionally assigned to message and message placed in holdback
queue (ordered with smallest sequence number at front)
Sending site chooses largest sequence number and notifies
replicas of final sequence number. Replicas replace provisional
sequence number with final sequence number.
When item at front of queue has an agreed final sequence number,
deliver the message.

2.13.6.5 Causal Ordering

``Cause'' means ``since we don't know application, messages might have
causal ordering''

a and b are events, generally sending and receiving of
messages.

We define the causal relation,
a $\rightarrow$ b, if

if a and b are events at the same process,

a $\rightarrow$ b implies a happened before b
if a is a message sent by process P1 and b is
the arrival of the same message at P2, then
a $\rightarrow$ b is true

In bulletin board, an article titled ``re: Multicast Routing'' in
repsonse to an article called ``Multicast Routing'' should always come
after, even though may be received before the initial article

2.13.6.6 CBCAST - Causal ordering in ISIS

ISIS is a real commercial distributed system, based on process groups.

Causal ordering for multicast within a group is based around
Vector Timestamps

The vector VT has an identifier entry for each member of the
group, typically an integer.

Vector timestamps have one operation defined

merge(u, v)[k] = max(u[k], v[k]), for k = 1..n

Incoming messages are placed on a holdback queue, until all messages
which causally precede the message have been delivered.

2.13.6.6.1 CBCAST Implementation

All processes p_i initialise the vector to zero
When p_i multicasts a new message, it first increments VT_i[i] by 1; it piggybacks
vt = VT_i on the message
Messages are delivered to the application in process P_j when
- The message is the next in sequence from p_i i.e.
  vt[i] = VT_j[i] + 1
- All causally prior messages that have been delivered to p_i must have been delivered to p_j, i.e.
  VT_j[k] $\geq$ vt[k] for k $\neq$ i.
When a message bearing a timestamp vt is delivered to p_j, p_j's timestamp is updated as
VT_j = merge(vt, VT_j)

In words

Incoming vector timestamp is compared to current timestamp.
If conditions for delivery to process not met, then message placed on
holdback queue.
When an incoming message is delivered, the timestamp is updated by the
merge.
Examine all messages in the holdback queue to see if they can be
delivered.
CBCAST requires reliable delivery.

2.13.6.6.2 Causal Example

$\includegraphics{figs/causal}$

2.13.6.6.3 Group View Changes

When group membership changes, what set of messages should be
delivered to members of changed group?
What happens to undelivered messages of failed members?
What messages should new member get?
ISIS solves by sending a sync ordered message announcing
that the group view has changed. Messages thus belong to a
particular group view.
Use coordinator to decide which messages belong to which view.

2.13.7 Summary

Replication of services and state increase availability
Replication increases performance
Replication increases Fault tolerance
To maintain consistency, multicast updates to all replicas
Use sequence numbers to maintain FIFO ordering
Use Vector Timestamps to maintain Causal Ordering
Use elected sequencers or identifier negotiation to maintain
total ordering

2.14 Shared Data and Transactions

Stateful Servers
Atomicity
Transactions
ACID
Serial Equivalence

2.14.1 Servers and their state

Servers manage a resource, such as a database or a printer
Attempt to limit problems of distributed access by making server
stateless, such that each request is independent of other
requests.
- Servers can crash in between servicing clients
- Client requests cannot interfere with each other (assuming
  concurrency control in server
But we can't always design stateless servers...

2.14.1.1 Stateful Servers

Some applications better modelled as extended
conversations, eg retrieving a list of records in a
large database better modelled as getting batch of records at a
time.
If application requires state to be consistent across a number
of machines, then each machine must recognise when it can update
internal data. Needs to keep track of state of distributed
conversation
If long duration then, then need to be aware of state.
If other conversations need to go on - eg modify records during
retrieval, how do we allow concurrency?
What happens if machine fails - need to recover.
Should also aim to be fault tolerant

2.14.2 Atomicity

Stateful server have two requirements

Accesses from different clients shouldn't interfere
with each other
Clients should get fast access to the server

2.14.2.1 Definition

We define atomicity as
All or NothingA client's operation on a server's resource
should complete successfully, and the results hold thereafter (yea,
even unto a server crash), or it should fail and the resource should
show no effect of the failed operation
IsolationEach operation should proceed without interference
from other clients' operations - intermediate effects should not be
visible.

2.14.2.2 Example

Mutual ExclusionFor a multi-threaded server, if two or more
threads attempt to modify the same piece of data, then the updates
should have mutual exclusion around the updates to provide
isolation, using semaphores or monitors
SynchronisationIn situations such as Producer Consumer, need
to allow one operation to finish so second operation can use
results, needing isolation.

2.14.3 Automatic Teller Machines and Bank accounts

$\includegraphics{figs/cashmachine}$

An ATM or cashmachine allows transfer of funds between accounts.
Accounts are held at various machines belonging to different
banks
Accounts offer the following operations
depositPlace an amount of money in an account

withdrawTake an amount of money from an account

balanceGet the current value in an account
Operations implemented as read() and write() of values, so
withdraw x from A and deposit x in B implemented as
1. A.write( A.read() - x)
2. B.write( B.read() + x)

2.14.4 Transactions

Transactions are technique for grouping operations on data so that
either all complete or none complete

Typically server offers transaction service, such as:

beginTransaction(transId)Record the start of a transaction and
associate operations with this transId with this transaction.
commitTransaction(transId)Commit all the changes the
operations in this transaction have made to permanent storage.
abortTransaction(transId)Abort all the changes the transaction
operations have done, and roll back to previous state.

2.14.4.1 ACID

Transactions are described by the ACID mnemonic

AtomicityEither all or none of the Transaction's operations
are performed. If a transaction is interrupted by failure, then
partial changes are undone
ConsistencySystem moves from one self-consistent state to another
IsolationAn incomplete transaction never reveals partial state
or changes before commiting
DurabilityAfter committing, the system never loses the results
of the transaction, independent of any subsequent failure

2.14.4.2 Concurrency Problems

2.14.4.2.1 Lost Update

Transaction T		Transaction U
A.withdraw(4,T)		C.withdraw(3,U)
B.deposit(4,T)		B.deposit(3,U)
balance = A.read()	£100
A.write(balance - 4)	£96
		balance = C.read()	£300
		C.write(balance - 3)	£297
balance = B.read()	£200
		balance = B.read()	£200
		B.write(balance + 3)	£203
B.write(balance + 4)	£204

2.14.4.2.2 Inconsistent Retrievals

Transaction T		Transaction U
A.withdraw(100,T)		Bank.total(U)
B.deposit(100,T)
balance = A.read()	£200
A.write(balance - 100)	£100
		balance = A.read()	£100
		balance = B.read()	£300
		+ balance
		balance = C.read()	£300+
		+ balance
balance = B.read()	£200
B.write(balance + 100)	£300

2.14.5 Serial Equivalence

Definition: Two transactions are serial if all the operations in
one transaction precede the operations in the other.

eg the following actions are serial

R_i(x)W_i(x)R_i(y)R_j(x)W_j(y)

Definition: Two operations are in conflict if:

At least one is a write
They both act on the same data
They are issued by different transactions

R_i(x)R_j(x)W_i(x)W_j(y)R_i(y) has
R_j(x)W_i(x) in conflict

Definition: Two schedules are computationally equivalent if:

The same operations are involved (possibly reordered)
For every pair of operations in conflict, the same operation
appears first in each schedule

So, a schedule is serialisable if the schedule is computationally
equivalent to a serial schedule.

Question: Is the following schedule for these two transaction serially
equivalent?

R_i(x)R_i(y)R_j(y)W_j(y)R_i(x)W_j(x)W_i(y)

2.14.5.1 Transaction Nesting

Transactions may themselves be composed of multiple transactions

eg Transfer is a composition of withdraw and
deposit transactions, which are themselves composed of read and
write transactions

Benefits:

Nested transactions can run concurrently with other transactions
at same level in hierarchy
If lower levels abort, may not need to abort whole transaction.
Can instead use other means of recovery.

2.14.6 Summary

Transactions provide technique for managing stateful servers
Need to worry about concurrency control
Need to worry about aspects of distribution
Need to worry about recovery from failure

2.15 Concurrency Control and Transactions

Problem restatement
Locking
Optimistic control
Timestamping

2.15.1 Why concurrency control?

To increase performance, multiple transactions must be able to
carry on work simultaneously...
...but if data is shared, then can lead to problems such as lost
updates and inconsistent retrievals.
So we must ensure schedules of access to data for concurrent
transactions are computationally equivalent to a serial schedule of
the transactions.

2.15.2 Locking

As in operating systems, locks control access for different
clients
Granularity of data locked should be small so as to maximise
concurrency, with trade-off against complexity.
To prevent intermediate leakage, once lock is obtained, it must
be held till transaction commits or aborts

2.15.2.1 Conflict rules

Conflict rules determine rules of lock usage
If operations are not in conflict, then locks can be shared
$\Rightarrow$ read locks are shared
Operations in conflict imply operations should wait on lock
$\Rightarrow$ write waits on read or write lock, read waits on write
lock
Since can't predict other item usage till end of transactions,
locks must be held till transaction commits or aborts.
If operation needs to do another operation on same data then
promotes lock if necessary and possible - operation may
conflict with existing shared lock

2.15.2.2 Rules for strict two phase locking

When operation accesses data item within transaction
1. If item isn't locked, then server locks and proceeds
2. If item is held in a conflicting lock by another transaction,
  transaction must wait till lock released
3. If item is held by non-conflicting lock, lock is shared and
  operation proceeds
4. If item is already locked by same transaction, lock is
  promoted if possible (refer to rule b)
When transaction commits or aborts, locks are released

2.15.2.3 Locking Implementation

Locks generally implemented by a lock manager
lock(transId,DataItem,LockType)Lock the specified item if
possible, else wait according to rules above

unLock(transId)Release all locks held by the transaction
Lock manager generally multi-threaded, requiring internal
synchronisation
Heavyweight implementation

2.15.2.4 Example

Transactions T and U.

T:
R_T(i), W_T(j, 44)
U:
W_U(i, 55)), R_U(j), W_U(j, 66)

Question What are the possible schedules allowed under strict
locking?

Question Are there any schedules computationally equivalent
to a serial schedule which are disallowed?

2.15.2.5 Deadlocks

Locks imply deadlock, under following conditions
1. Limited access (eg mutex or finite buffer)
2. No preemption (if someone has resource can't take it away)
3. Hold and wait. Independent threads must possess some of its needed resources and waiting for the remainder to become free.
4. Circular chain of requests and ownership.
Most common way of protecting against deadlock is through
timeouts. After timeout, lock becomes vulnerable and can be
broken if another transaction attempts to gain lock, leading to
aborted transactions

2.15.2.6 Drawbacks of Locking

Locking is overly restrictive on the degree of concurrency
Deadlocks produce unnecessary aborts
Lock maintenance is an overhead, that may not be required

2.15.3 Optimistic Concurrency Control

Most transactions do not conflict with each other
So proceed without locks, and check on close of transaction that
there were no conflicts
- Analyse conflicts in validation process
- If conflicts could result in non-serialisable schedule, abort
  one or more transactions
- else commit

2.15.3.1 Implementation of Optimistic Concurrency Control

Transaction has following phases

Read phase in which clients read values and acquire tentative
versions of items they wish to update
Validation phase in which operations are checked to see if they
are in conflict with other transactions - complex part. If invalid,
then abort.
If validated, tentative versions are written to permanence, and
transaction can commit (or abort).

2.15.3.2 Validation approaches

Validation based upon conflict rules for serialisability
Validation can be either against completed transactions or
active transactions - backward and forward validation.
Simplify by ensuring only one transaction in validation and
write phase at one time
Trade-off between number of comparisons, and transactions that
must be stored.

2.15.3.2.1 Forward Validation

A transaction in validation is compared against all
transactions that haven't yet committed
Writes may affect ongoing reads
The write set of the validating transaction is compared
against the read sets of all other active transactions.
If the sets conflict, then either abort validating
transaction, delay validation till conflicting transaction
completes, or abort conflicting transaction.

2.15.3.2.2 Backward validation

Writes of current transaction can't affect previous
transaction reads, so we only worry about reads with overlapping
transactions that have committed.
If current read sets conflict with already validated
overlapping transactions write sets, then abort validating
transaction

2.15.4 Timestamping

Operates on tentative versions of data

Each Transaction receives global unique timestamp on initiation
Every object, x, in the system or database carries the maximum (ie
youngest) timestamp of last transaction to read RTM(x)^2.3 and maximum of last transaction to write
WTM(x)^2.4
If transaction requests operation that conflicts with younger
transaction, older transaction restarted with new timestamp.
Transactions committed in order of timestamps, so a transaction
may have to wait for earlier transaction to commit or abort before
committing.
Since tentative version is only written when transaction is
committed, read operations may have to wait until the last
transaction to write has committed.

An operation in transaction T_i with start time TS_i
is valid if:

The operation is a read operation and the object was last
written by an older transaction ie
TS_i > WTM(x). If read permissible,
RTM(x) = MAX(TS_i, RTM(x))
The operation is a write operation and the object was
last read and written by older transactions ie

TS_i > RTM(x) and
TS_i > WTM(x). If permissible,
WTM(x) = TS_i

2.15.5 Summary

Locks are commonest ways of providing consistent concurrency
Optimistic concurrency control and timestamping used in some
systems
But, consistency in concurrency is application dependent - for
shared editors, people may prefer to trade speed of execution
against possibilities of conflict resolution. Problems can occur
with long term network partition. Approaches based on notification
and people resolution becoming popular.

2.16 Distributed Transactions

Models for distributed transactions
Attaining distributed commitment
Distributed Concurrency Control

2.16.1 Single Server Transactions

$\includegraphics{figs/single-server-trans}$

Till now, transactions have referred to multiple clients, single
server.
How do we have multiple clients interacting with multiple
servers? eg complicated funds transfer involving different accounts
from different banks?
Generalise transactions to distributed case...

2.16.2 Distributed Transactions

2.16.2.1 Distributed Transaction Requirements

General characteristics of distributed systems

Independent Failure Modes
No global time
Inconsistent State

Need to consider:

how to achieve distributed commitment (or abort)
how to achieve distributed concurrency control

2.16.2.2 Models

$\includegraphics{figs/transaction-models}$

If client runs transactions, then each transaction must complete
before proceeding to next
If transactions are nested, then transactions at same level can
run in parallel
Client uses a single server to act as coordinator for all
other transactions. The coordinator handles all communication with
other servers

Question: What are the requirements of transaction ids?

2.16.3 Atomic Commit Protocols

Distribution implies independent failure modes, ie machine can
fail at any time, and others may not discover.
If one phase commit, client requests commit, but one of
the server may have failed - no way of ensuring durability
Instead, commit in 2 phases, thus allowing server to request abort.

2.16.3.1 2 Phase Commit

One coordinator responsible for initiating protocol.
Other entities in protocol called participants.
If coordinator or participant unable to commit, all parts of
transaction are aborted.
Two phases
Phase 1Reach a common decision

Phase 2Implement that decision at all sites

2.16.3.1.1 2 Phase Commit Details

Phase 1 The coordinator sends a Can Commit? message to all
participants in transaction.
Participants reply with vote yes or no. If vote
is no participant aborts immediately.
Phase 2 Coordinator collects votes including own:
1. If all votes are yes, coordinator commits transaction
  and sends DoCommit to all participants.
2. Otherwise transaction is aborted, and coordinator sends abortTransaction to all participants.
When a participant recieves DoCommit, it commits its
part of the transaction and confirms using HaveCommited

2.16.3.1.2 2 Phase Commit Diagram

$\includegraphics{figs/2phasecommit}$

Note:

If participant crashes after having voted to commit, it can ask
coordinator about results of vote.
Timeouts are used when messages are expected.
Introduces new state in transaction Prepared to commit.

2.16.4 Distributed Concurrency Control

2.16.4.1 Locking

Locking is done per item, not per client.
No problems generalising to multiple servers...
...except in dealing with distributed deadlock
Same techniques as usual, but interesting dealing with
distributed deadlock detection.

2.16.4.2 Optimistic Concurrency Control

Need to worry about distributed validation
Simple model of validation had only one transaction being
validated at a time - can lead to deadlock if different cordinating servers
attempt to validate different transaction.
Also need to validate in correct serialisable order.
One solution is to globaly only allow one transaction to
validate at a time.
Other solutions is to validate in two phases with timestamp
allocation - local, then global to enforce ordering.

2.16.4.3 Timestamping

If clocks are approximately synchronised, then timestamps can be

< localtimestamp, coordinatingserverid >

pairs, and an ordering
defined upon server ids.

2.16.5 Summary

Nested Transactions are best model for distributed transactions
Two Phase Commit protocol suitable for almost all case
Distributed Concurrency control is only slightly more diffcult
than for single server case

2.17 Transactions: Coping with Failure

Failure Modes
Recovery Techniques
Partitions and quorum voting

2.17.1 Failure Modes

For Transactions to be atomic and durable, need to examine
failures

Transaction-local failures, detected by the application which
calls abort eg insufficient funds. No info loss, need
to undo changes made.
Transaction-local failures , not detected by application, but by
system as whole, eg divide by zero. System calls abort.
System failures affecting transactions in progress but not media
eg CPU failure. Loss of volatile store and possibly all
transactions in progress. On recovery, special recovery manager
undoes effects of all transactions in progress at failure.
Media failures affecting database eg head crash. No way of
protecting against this.

2.17.2 Recovery

We assume a machine crashes, but then is fixed and returns to operation^2.5.
We need to recover state to ensure that the guarantees of the
transactional systems are kept.
Use a recovery file or log that is kept on permanent storage.

2.17.2.1 The Recovery Manager

Recovery from failure handled by entity called Recovery Manager.
Keeps information about changes to the resource in a recovery file (also called Log)
kept in stable storage - ie something that will survive failure.
When coming back up after failure, recovery manager looks through
recovery file and undoes changes (or redoes changes) so as uncommitted
transactions didn't happen, and committed transactions happened.
Events recorded on Recovery file for each change to
an object in database.

2.17.2.2 Recovery File

Information recorded per event include:
Transaction IdTo associate change with a transaction
Record IdThe identifier of the object
Action typeCreate/Delete/Update etc
Old ValueTo enable changes to be undone
New ValueTo enable changes to be redone

Also log beginTransaction, prepareToCommit, commit, and abort actions,
with their associated transaction id.

2.17.2.3 Recovering

If after failure,

the database is undamaged, undo all changes made by
transactions executing at time of failure
the database is damaged, then restore database from archive and
redo all changes from committed transactions since archive date.

The Recovery file entry is made and committed to stable storage before
the change is made - incomplete transactions can be undone, committed
transactions redone.

What might happen if database changed before recovery file written?

Note that recovery files have information needed to undo transactions.

2.17.2.4 Checkpointing

Calculation of which transaction to undo and redo on large logs can be
slow.

Recovery files can get too large

Instead, augment recovery file with checkpoint

Force recovery file to stable storage
Write checkpoint record to stable store with
1. A list of currently active transactions
2. for each transaction a pointer to the first record in
  recovery file for that transaction
Force database to disk
Write address of checkpoint record to restart location atomically

2.17.2.5 Recovering with checkpoints

To recover, have undo and redo lists. Add all active transactions at
last checkpoint to undo list

Forwards from checkpoint to end,
1. If find beginTransaction add to undo list
2. If find commit record add to redo list
3. If find abort record remove from undo list
backwards from end to first record in checkpointed transactions,
execute undo for all transaction operations on undo list
Forwards from checkpont to end, redo operations for transactions
on redo list

At checkpoint can discard all recovery file to first logged record in
checkpointed transactions

2.17.2.6 Recovery of the Two Phase Commit Protocol

$\includegraphics{figs/2phasecommit}$

Coordinator uses prepared to signal starting protocol,
commit on signalling DoCommit and done to indicate end
of protocol in recovery file.
Participant uses uncertain to indicate that it has
replied yes to commit request, and commited when it
receives DoCommit.
On recovery, coordinator aborts transactions which reach
prepared, and resends DoCommit when in commit state
but not done
Participant requests decision from coordinator if in
uncertain state, but not commited.

2.17.3 Network Partition

Transactions are often used to keep replicas consistent.

If network partitions (cable breaks), replicas divided into two or more sets
(possibly with common members).

$\includegraphics{figs/partition}$

Can we still write and read from any of the sets?

Yes, but

Must reduce possible read and write sets to maintain consistency
Or relax consistency requirements and resolve problems when
partition is healed

2.17.3.1 Quorum Consensus

Consider set of replicas, where replicated objects have version
numbers at each replica.

Assign a weighting of votes to each replica, indicating importance.
For client to perform operation, it must gather votes for all the
replicas it can talk to (denote X).
X $\ge$ votes set for read quorum R to enable read
X $\ge$ votes set for write quorum W to enable write.
As long as
- W > half the total number of votes
- R + W > total number of votes in group
Each Read quorum and each write quorum will have at least one member
in common.

2.17.3.2 Partition Example

For three replicas, R1, R2, R3, we can allocate votes to give
different properties depending upon requirements

Replica	config 1	config 2	config 3
R1	1	2	1
R2	0	1	1
R3	0	1	1

What should R and W be set to in the three
configurations?
What properties result from these configurations?

2.17.3.3 Read and Write Operations

When write happens, object has a version number incremented

To read, collect votes from replica managers with version numbers of
object. Guaranteed to have at least one up to date copy if in read
quorum, from which read occurs.

To write, collect votes from replica managers with version numbers of
object. If write quorum with up to date copy not discovered, then
copy up to date copy around to create write quorum. Then write is
allowed.

Manipulating R and W give different characteristics eg R = 1 and W = number of copies gives unaminous update.

Cached copies of objects can be incorporated as weak
representatives with 0 votes, but usable for reads.

2.17.4 Summary

Atomicity comes from using logging techniques on operations at
server, where log is kept on stable storage
Voting can be used to give availability for resources on
partitioned replicas.

3. Exercises and answers

订阅：博文 (Atom)

2009年5月25日星期一

2009年5月12日星期二

2009年4月28日星期二

文档的编写：

程序开发

2009年4月22日星期三

问题的来源

分布式存储系统的分类

主要存储系统简介

reading notes for <<Above the Clouds: A Berkeley View of Cloud Computing>>

2009年4月20日星期一

VGA是鸡肋？14款电视PC连接横向评测!

2009年4月13日星期一

Download and install OpenXCAP

2009年4月10日星期五

2009年4月1日星期三

source: http://www.cogs.susx.ac.uk/courses/dist-sys/node1.html

2.1.2 What's a Distributed System

2.1.3.1 Network assumptions

2.2.3 Memory and Packets

2.2.4 Bit Manipulation

2.3.1 Physical Concepts

2.3.2 Packets

2.3.2.1 Addressing

2.3.2.3 Indirect Addressing

2.3.2.4 Architecture of a switch

2.3.2.5 Distributed Routing Table Maintenance

2.3.3 Conclusion: Network Properties

2.8.2.1 Secure Digest Functions

2.8.3.1 Symmetric Encryption

2.8.4.1 Public key directory

2.8.5 Secure Socket Layer

2.8.5.1 SSL handshake protocol

2.8.6.2 Digital Rights Management

2.8.8 Trusted Computing Platform

2.12 Content Distribution Networks

2.12.1 Getting Content over a Network

2.12.2 Web Caches

2.12.2.1 Cache Performance

2.12.2.2 Problems with Caching

2.12.3 Pre-fetching Data

2.12.3.1 Active Content Distribution

2.12.4 Using your Peers: BitTorrent

2.12.4.1 The .torrent file

2.12.4.2 Locating peers

2.12.4.3 Choosing Pieces

2.12.4.4 Choosing Downloaders

2.12.5 Summary

2.13 Replication: Availability and Consistency

2.13.6.1 Ordering example

2.13.6.2 Ordering Definitions