听了林学民教授的一个学术报告

Graph-based Structure Search

林学民教授是数据库理论方面的专家。今天的学术报告，林教授给我们简单介绍了他的研究团队在海量数据检索方面的研究成果。

在快带发展的互联网社会，越来越多的应用需要检索快速地检索基于图结构的海量数据。比如近两三年兴起的，我们非常熟悉的社交网络。对于图的结构化检索是一个NPC问题。海量的数据规模使得问题变得非常的棘手。我搜索了一些资料也没搞清楚什么叫NPC问题。我简单地理解，是当问题的规模变得越来越大时，就不能确定是否能在确定的时间内求出解。

林教授先简单地提到对图的研究方向大致分Mining Graphs,和Quering Graphs。Mining Graphs方面有Frequent Patterns, structure Similarityies, clustering, Ranking, classification。Prof. Lin的团队主要做了Structure Similarities方面的一些研究。对于Querying Graphs，又大致分为Readibility, Connectivity between two nodes, substructure search, shortest paths, strings。Prof. Lin话：对于shortest path，理论上求解很simple，但是应用到大量数据的时候，就要做很多工作。比如要做非常高效的索引。对于strings，一个很长的text可以看成一个长string。比于现在的internet搜索引擎，要解决区分互联网上太多冗余数据的问题，就属于这方面的研究。我知道搜索引擎是鼓励原创的，会给原创的页面比较高的page rank。针对这点，black hat SEO就会搞伪原创，怎样让搜索引擎快速区分出真正的原创内容呢？应该也属于这方面的课题吧。

林教授开始讲他的科研团队的第一个贡献。这个contribution是解决了关于图与图之间包含的问题。问题描述为：给出一个Data Graph和一个Query Graph，要在一个规模非常大的Data Graph里面找到这个Query Graph。林教授介绍了这个过程中的Filtering和Verification算法简要思路。对于不懂学术的我，这部分真是相当地难以理解啊！！！设D是要处理的图。First index a set F of features from D。对于任意f属于F, 求f的超图。索引的算法思路是。如果q是一个frequent subgraph,则 1. 如果q已经被索引，则返回q的超图。2.如果q是一个超图，次q被索引了，不需要给次q的超图做verification。如果q不是一个frequent subgraph的情况下，验证受制于sigmal|D|。林教授认为这个求解过程的主要开销在verification阶段。他们的团队就是把这个cost显著降了一下，而且思路非常地clear和simple。他总结了三条经验： 1. Access infrequent labels as early as possible. 2. Retain connectivity。 3. Effectively use degree information.

The second contribution of Prof. Lin is to put the the theoretic algorithm into practice.有两个问题，一个是query graph被data graph包含(contained)，另一个是query graph包含了data graph.林教授的团队的最结成果比已经发表了论文的另一个科研团队的成果的速度要快了一个数量级。Prof. Lin的语气显得有些遗憾，他们的思路是正确的，实现的效率也很高，但是时间上慢了小小。

http://ykyi.net zausiu's blog

发言队段，研究这个方向的甘同学问了非常专业的问题，我不懂啊！还有一位信科院的研一同学问到了留学的问题，Prof. Lin说他要最top的学生，非常地talented，非常地diligent, 非常地good at algorithm，还要有非常好的practical ability。朝红阳教授向林教授打听有没有做图象search方面的大牛。

学术报告好难啊！我想听工程类的报告。但都是academic report啊！

//////////////////

Topic:

Graph-based Structure Search

Speaker:

Prof. Xuemin Lin

School of Computer Science and Enmgineering,University of New South Wales

and

School of Software ,Eastern China Normal University

Abstract:

Recent years have witnessed an emergence of graph-based applications that strongly demand efficient processing of graph-based queries and searches.

In many real applications such as chem- and bio-informatics, road neworks, social networks, etc., efficiently and effectively conducting structure search on graphs is a key.

The problem of structure search on graphs and its variants are NP-complete. The presence of a large number of graphs makes the problem computationally more intractable.

In this talk, I will introduce recent techniques in graph structure search with the focus on our recent work published in SIGMOD, VLDB, ICDE, etc. My talk will cover the problems of substructure search, superstructure search, and substructure similarity search.

Brief Bio:

林学民，新南威尔士大学大学计算机科学及工程学院教授，数据库研究实验室主任。自2009年4月起，在华东师范大学担任兼职教授，2010年入选国家第四批千人计划学者，2011年6月起受聘于华东师范大学软件学院千人学者。生于上海市，1984年毕业于复旦大学数学系，1992年在昆士兰大学获得博士学位。长期从事数据库理论、算法与技术研究，包括空间数据、不确定数据、流数据、图数据等领域。近年来在数据库领域顶级国际学术会议及顶级杂志上共发表了70余篇学术论文；累计在数据库和算法领域重要的国际学术会议和国际学术期刊发表并录用了170余篇论文，其中10篇国际会议优秀论文，包括ICDE'07上的优秀学生论文奖 (best student paper award) ICDE'10上的优秀论文(one of the best papers) , 及SIGMOD'11 上的优秀论文(one of the best papers)。他目前的研究兴趣是海量数据的处理，包括图数据、时空数据、字符串数据、不确定数据等。学民先后在4th Australasian Theory Symposium (CATS'98), 6th International Computing and Combintorics Conference (COCOON2000), 6th Asia Pacific Web Conference (APweb04), the joint conference of 9th Asia Web Conference and 8th Web-Age Information Management Conference (APweb/WAIM2007), 和 19th and 20th Australasian Database Conference (ADC08, ADC09) 担任会议程序委员会主席，在14th International Conference on Database Systems for Advanced Applications (DASFAA09) 上担任大会主席。目前他是ACM TODS的编委。

Presided by

Prof. Jianlin Feng

Date and Venue

Dec 21, 2011(Wed.) 14：00-15：00

Lecture Theater A101, School of Software, Sun Yat-sen University

冬大过年！无痛终结2011年.展望2012

写在冬至日，总结2011展望2012。冬大过年是广东的俗语。无痛终结是因为人心态早已看得化(开)。

首先：在无比苦逼的11年里，最大的贡献就是返校读书了。基本上确定可以了结本科毕业时被取消学位的心愿。

其次工作上没有什么好说的。在东莞光大集团工作了半年，不能说辛苦也不算闲。工作上没学到什么东东，没明显地进步。在此谢谢赵经理了，如果我有赵经理的硬件水平我就笑哈哈哈哈。以识了几位有意思的同事。

感情上则一如既往地Fail, Fail, Fail 。。。完全木有信心了。哥爱无力了。到了年尾心态竟变得更无所谓，该怎样就怎样吧。只是让家人操心了，对不起啊，尽力了啊。奶奶，对不起啊。也想让你开心的。哎！想起过年回家要去那么多亲戚拜年，555，拜托你们不要问不要问吖。

最后讲讲返校后的生活。在中大的生活第一学期课程非常多，非常忙。我还有我的plan要投入很多时间。时间无论如何不够用！同学关系远没本科时紧密。可能我的心态跟刚毕业的后生仔们相差甚远吧。写到这里，想起今年国庆时去了其它地方没有参加强强和敏的婚礼。唉！什么时候去江苏再给我机会补偿好吗。我做人是越来越失败了，完全是不分好坏敌我的混乱！

2012年是世界的最后一年。可恶的地球人，觉悟吧！

以前每年年尾都会许好多愿望。结果第二年能实现的却渺渺无几。反正都2012了，哥的愿望就务实一点好吧。维护世界和平的任务还是不要勉强啦！

明年必须在寒假结束前看完Unix网络编程.

通过六月的BEC高级考试。

全年选择性的读完Essential Linux Device Drivers和Understanding linux Kernel.

上半年看paper，形成写论文的思路。下半年动手写论文。

下半年拿到offer。银行？民航？其它国企？还是互联网公司抑或其它什么公司？…

感情的事务不设定目标。苦逼的技术宅男果断必然无解！

Unix 网络编程第十章 SCTP Client/Server Example

这一章的内容还是满少的。也就是给出一个SCTP的简单例子。所以也没有太多需要做笔记的。

1. 什么是 head-of-line blocking.

Head-of-line blocking occurs when a TCP segment is lost and a subsequent TCP segment arrives out of order. That subsequent segment is held until the first TCP segment is retransmitted and arrives at the receiver.

2. 怎么更改SCTP连接的stream的数量。

SCTP连接的streams的数量是在association的握手之前协商好的。对于FreeBSD的KAME实现，SCTP的outbound streams默认为10。这个值可以用setsocket函数更改。与SCTP_INITMSG scoket option相关，设置struct sctp_initmsg结构体。

也可以用sendmsg函数发送ancillary数据来到达同样的目标。但发送ancillary data只对one-to-many形式的sctp socket有效。

3. 怎么结束一个SCTP连接。

可以设置sctp_sndrcvinfo结构的sinfo_flags值的MSG_EOF flag来关闭一个sctp连接gracefully. This flag forces an association to shut down after the message being sent is acknowledged.

还可以给sinfo_flags设置 MSG_ABORT。这样就会立即发送一个ABORT给peer端。任何还没来得及发送出的数据会被丢弃。

sicily 2014. Dairy Queen 解题报告

Sun Yat-sen University ACM 2014. Dairy Queen

1. 原题中文大意;

给出一个在 [1, 300] 之间的数字N。然后给出一个集合，这个集合中有C 个数字，C在[1, 8]之间。该集合中的每个数字在[1, 200]之间。问如何用这个集合中的数字组合（数字可以重复使用），使它们的总和等于N。

2. 解这种题毫不犹豫地想到用动态规划。

3. 设一个数组 ways[301]; ways[n]表示有多少种方法可以组合成和n。假设集合中只一个数字x。那么问题就非常简单，只能取n个该数字，和的集合为{x, 2x, 3x}，即ways[x]=1，ways[2x]=1, ways[3x]=1 ….。如果集合中有m数字，假设我们已经求得了用这m个数字的组合可以得到的和的数目存到了ways数组里。即对于m个数字，已求得 ways[1] = ?, ways[2] = ?, ways[3] = ? ….. Ways[300] = ?。那么在此基础上，再加多一个数字，即 m+1个数字的情况会是如何呢。设新加入的这个数字的值是p，则应该是 ways[q] += ways[q-p]。（q > p）

4. 程序注释清单.

#include <string.h>

#include <stdlib.h>

#include <stdio.h>



int main(int argc, char* argv[])

{

int N, C, coin;  // N为题意中的N，C即题意中的C. coin作临时变量用，不解释，可见下面的代码。

int ways[301] = { 0 };  // 见第三部分的分析。

ways[0] = 1;  // 没有实际意义。为了方便计算.



scanf("%d %d", &N, &C);

for ( int i = 0; i < C; i++)

{

scanf("%d", &coin);

for ( int j = coin; j <= N; j++ )

{

ways[j] += ways[j-coin];  // 见第3部分的分析.

}

}

printf("%d\n", ways[N]);



return 0;

}

5.时间复杂度，空间复杂度分析。

空间复杂度是确定的常数。时间复杂度是o(n). n为有多少种面值的硬币。

http://ykyi.net zausiu's blog

原题：

Description

Bessie, always in need of an income, has decided to leverage her dairy skills by taking a part-time job at the local Dairy Queen restaurant. She is running a special cash register (since she has hooves instead of fingers and thus requires special accommodation). Her job is making change for customers as they make a purchase.

As she was counting out 83 cents in change, she wondered: "How many ways can I count out 83 cents? I can use three quarters and eight pennies, seven dimes and three pennies, 83 pennies… there must be a zillion ways!"

How many different ways can one make change for N (1 ≤ N ≤ 300) cents using coins from a set of C (1 ≤ C ≤ 8) coins of supplied values C_i (1 ≤ C_i ≤ 200)? "Different" means differing counts of coins.

Thus, 8 cents can be made, in American currency, with 1 five-cent piece + 3 one-cent pieces and also with 8 one-cent pieces. Using 3 one-cent pieces + 1 five-cent piece is the same as 1 five-cent piece + 3 one-cent pieces, so one can create eight cents in just two different ways. Note that some coin systems are poor at making change and result in an answer of '0'.

Coin values in the input file are listed in descending order from largest to smallest. All coin values will be distinct.

Input

Line 1: Two space-separated integers: N and C
Lines 2..C+1: Line i+1 contains a single integer: C_i

Output

Line 1: A single line with a single integer that is the number of ways to create N cents of change using the supplied coins. The answer is guaranteed to fit into a signed 32-bit int

Sample Input

Sample Output

Unix网络编程.第九章笔记

# TCP 协议在1981年标准化，而SCTP协议则是在2000年由IETF组织标准化。IETF： Internet Engineering Task Force 互联网工程任务组成立于1985年。

# SCTP支持两种类型的socket. one to one(为了方便现有的tcp程序迁移到sctp) 和 one to many.

# sctp的sctp_bindx可以绑定多个地址，并且可以动态增删。但动态增删不影响已经存在的associations.不过各个unix分支不一定支持这个特性。

# sctp_getpaddrs和sctp_getladdrs分别是getpeername和gethostname的sctp版。They are designed with the concept of a multihoming aware transport protocol.

# sctp传输的报文是有边界的。sctp在传输层支持传输超大的message.

Is your biometric data safe

Kot教授的这个讲座在网页上写的标题是：IEEE Distinguished Lecture Talk。多么气质不凡的Distinguished啊！！！作为一个小工程硕，实在有受宠若惊之感。

讲座的大概讲的Kot教授提出了如何构建一个新颖的(novel)系统来保证Biometric信息在服务端的安全。即使这部分信息泄露出去，攻击者也不能够利用它。首先什么是BIOMETRIC呢？你的指纹，你的面部，你的头发…都属于Biometric。其实Biometric对于大家都不会陌生，指纹考勤系统实在是相当之普遍啊。而且科幻或者犯罪电影里还有虹膜校验。

Kot教授提出Biometric的验证系统的不安全的方面，如果被泄露出去，Biometric信息不像普通密码一样容易被更换，一但被伪造这样就会比传统的验证方法更加麻烦。Prof. Kot简单地walk through了几个保护Biometric的解决方案，并说明Biometric信息是可以被利用被伪造的。

PPT中多次提到Minutiae的概念，一直没听明白啊！貌似是从原始的Biometric信息中提取出部分信息而作为用来验证的数据。Professor Kot论证常见的基于指纹的验证系统的弊端。论证中非常多的“valley”，"ridge"，比较变换前后的差异。不明白啊，不明白！大概的结论是目前的技术可以做到从Minutiae中reconstruct到原始的original fingerprint。后来想这时候论证的是为了支撑其后Prof. Kot提出的新的novel solution.

Kot提出的解决方案是在采集指纹信息时需要采集用户的两个不同的手指，然后综合这两个手指的信息计算出一个Minutiae。这个Minutiae与从采集单个手指得到的Minutiae几乎没有差别。因为这个Minutiae记载的是两个不同手指的信息，而且每个手指仅仅只记部分信息。这样的话，即使攻击者盗取了这个Minutiae，它就无法原始到原始的fingerprint。按我自己的理解，攻击者应该也不清楚它还原得到的指纹其实并不存在，因为Minutiae是两个指纹的信息合成的而且与由单个指纹得到的几乎一样，攻击者无法分辨。攻击者应该会沾沾自喜reconstruct到了被误导的original finger print。

Kot还指出在他提出的这个解决方案中可以嵌入信息记到Minutiae里面。在后面的提问阶段。有同学问到这个解决方案可以存入多少信息。Kot补充目前可以存一千多个Bit位信息，足够存下重要的身份认证数据了。按我自已的理解，能存多少信息应该是可以根据需要调大和缩小的呀。比如把图形转成矢量图形后，就可以不失真的放大，于是就会有更多存储空间了。不确定是不是这样。

提问阶段Kot指出这个设计没有使用token或者key因此会运行的非常迅速。而如果采用复杂的加解密算法，则会有非常大的计算开销。运行速度对于一些嵌入式的电子系统是至关重要的。

有同学提问提到做Face Recognition，问Kot教授有什么看法。Kot认为Face Recognition is very difficult, very very difficult, few advancement has been achieved nowadays. It's a very tough problem.

At last, as a conclusion for this report, I must show my great respect to Alex Kot. He is a great scientist.

/////////////////

IEEE Distinguished Lecture Talk

Title：Is your biometric data safe?

Speaker：Prof. Alex C Kot, 新加坡南洋理工大学教授、IEEE Fellow

时间： 12月16日（本周五）下午3：00

地点： 信息科技学院大楼二楼207讲学厅

主持人： 黄继武 教授

主办单位：信息科学与技术学院、IEEE Signal Processing Society Guangzhou Chapter, IEEE Circuits and Systems Society Guangzhou Chapter

Abstract

Nowadays, biometrics is widely used in authentication systems. In general, biometrics needs to be stored in a database for subsequent authentication. However, templates stored in the database are at the risk of being stolen or modified. Once the template is stolen, it is difficult to be replaced like passwords and the private user information associated with the stolen template would also be exposed. Thus, biometrics templates should be stored in the database such that both the security of the template and the privacy of the user are not compromised under various attacks.

We first propose a fingerprint authentication system for the privacy protection of the fingerprint template stored in a database. The considered fingerprint data is a binary thinned fingerprint image, which will be embedded with some private user information without causing obvious abnormality in the enrollment phase. In the authentication phase, these hidden user data can be extracted from the stored template for verifying the authenticity of the person who provides the query fingerprint. A novel data hiding scheme is proposed for the thinned fingerprint template. Compared with using existing binary image data hiding techniques, the proposed method causes the least abnormality for a thinned fingerprint without compromising the performance of the fingerprint identification.

The minutiae is another type of data stored in a fingerprint template. We investigate to what extreme a reconstructed fingerprint can be similar to the original fingerprint. A new scheme is proposed to reconstruct a full fingerprint image from the minutiae points. Experimental results show that the successful match rate between our reconstructed fingerprint and the original fingerprint is over 99% at FAR=10^-4. When matched against the different impressions of the original fingerprint, our reconstructed fingerprint has over 86% successful match rate at FAR=10^-4. We consider the privacy issues of the fingerprint reconstruction. The analysis shows that our proposed technique is useful for protecting the fingerprint ridge frequency.

As a reconstructed fingerprint can be so similar to the original fingerprint, it is also very important to protect the privacy of the minutiae template stored in a database. We propose a novel system for protecting the privacy of the fingerprint minutiae without using a token or key. In the enrollment, two fingerprints are captured from two different fingers of the same person. A combined minutiae template containing only a partial minutiae feature of each of the two fingerprints will be generated and stored in a database. In the authentication, the user needs to provide two query fingerprints from the same two fingers which are used in the enrollment. By storing the combined minutiae template, the complete minutiae feature of a single fingerprint will not be compromised when the database is stolen. Furthermore, because of the similarity in topology, it is also difficult for the attacker to distinguish our template from the minutiae of an original fingerprint. The experimental results show that our system can achieve a very low error rate.

Biography

Dr Kot has been with the Nanyang Technological University, Singapore since 1991. He headed the Division of Information Engineering at the School of Electrical and Electronic Engineering for eight years until 2005. The Division’s research focuses are on signal processing for image, video, speech and audio. He started serving as Vice-Dean (Research) for the School of EEE in 2005 and became Associate Dean for the College of Engineering in 2008. He is currently a Professor at the School of EEE and Associate Dean for the College of Engineering. He has published extensively with over 200 technical papers and 3 patents in the areas of signal processing for communication, biometrics recognition, data-hiding, authentication and media forensics.

Dr. Kot served as Associate Editor for the IEEE Trans. on Signal Processing, IEEE Trans. on Multimedia, IEEE Trans. on Circuits and Systems for Video Technology; and IEEE Trans. on Circuits and Systems Part II as well as Part I. He also served as Guest Editor for the Special Issues for the IEEE Trans/ on CSVT and JASP. He is currently Associate Editor for the IEEE Trans. on Information, Forensics and Security, IEEE Trans. on Image Processing and IEEE Signal Processing Letter. He is also Editor for the EURASIP Journal of Advanced Signal Processing, the IEEE Signal Processing Magazine and the IEEE Journal of the Special Topics in Signal Processing. He is an Editorial Board member of the Journal of Fundamental and Theory in Signal Processing. He serves in the IEEE CAS Visual Signal Processing and Communication, the IEEE SPS Image and Video Multi-dimensional Signal Processing and IEEE SPS Information Forensic and Security technical committees. He has served the IEEE Society in various capacities such as the General Co-Chair for the 2004 IEEE International Conference on Image Processing (ICIP) and Chair of the worldwide SPS Chapter Chairs and the Distinguished Lecturer program. He serves as IEEE Fellow Evaluation Committee. He received the Best Teacher of the Year Award and is a co-author for several Best Paper Awards including ICPR, WIFS and IWDW. He was the IEEE Distinguished Lecturer in 2005 and 2006 and is a Fellow of IEEE and IES.

Distributed Framework for Iterative Computations on Massive Datasets

I must confess I've only got a few words and a vague skeleton of what the Professer Lixin Gao said due to the language barrier, during the whole speech Ms Gao spoke English, and the topic is so pedantic which is full of the complicated mathematic formulation.Personally，我未有涉及过海量数据并行计算以至云计算相关的任何技术，基本上对于海量数据的所有技术都让我觉得非常地cutting-edge。带着学习开拓眼界拓展知识面的良好愿望参加了Ms. Gao的学术讲座。Ms. Gao讲PPT时全是英文啊。我努力地听，努力地听！！！适当地记些笔记。其实还是很喜欢听英语的。

讲座先列举了一些大型互联网公司的数据显示集群计算的规模。比如谷歌在数据中心有一百万台服务器！！！Holy gosh，1,000,000 servers！！！

然后谈到了很多大规模数据的计算对时间的要求非常地严格。于是谈到了目前流行的算法：MapReduce。MapReduce是google提倡的一种编程模型，用于大规模数据集（大于1TB）的并行运算。“Map”和 “Reduce”，和他们的主要思想，都是从函数式编程语言里继承来的，还有从矢量编程语言里借来的特性。他极大地方便了编程人员在不会分布式并行编程的情况下，将自己的程序运行在分布式系统上。我自己的理解是：map索引整个海量数据，生成key-value pair which is used to facilitate the quick query and computing。而reduce则是用来分析和统筹基于键值对的中间计算结果，使中间结果迅速向最终的正确结果收敛。

Ms. Gao 提出MapReduce虽然是工业界非常popular的解决方案，但是不可以以迭代的方式计算。而Ms. Gao所做的工作即是使MapReduce支持迭代的功能。（A framework that support iterative computing in the cloud.）从事情一般规律的哲学高度看待这个问题，对于规模相当大或者最终结果是相当不明确的问题，似乎都可以用到'迭代'的思想。例如，迭代的软件开发模式则是当今互联网应用的主要开发模式。每一次迭代都使产品向理想中最perfect的状态接近。Ms. Gao用迭代的思想处理海量数据应该也是同样的道理罢！

Ms. Gao的解释说改进后支持迭代的iMapReduce可以使map后的数据支撑reduce,reduce的数据亦转而支撑map，形成一个迭代反复的过程，其中在map阶段加入static data。较之原始的MapReduce，新的算法使得static data的规模显著地减少。Ms. Gao展示了几个图表证明改进后增加了迭代功能的MapReduce(称之为 iMapReduce )的速度优于原始的MapReduce.图表中有展示用iMapReduce计算PageRank的值。PageRank？是不是判断网站权重的那个东东啊？哥的网站http://ykyi.net的PageRank才0啊！！！才0啊~~用什么算法可以算成9啊！

PPT转到计算图的指定两点最短路径的Dijkstra算法。泪流满面，终于有一个我懂的算法了啊！Ms. Gao说先找到两个结点间的最短路径，于是赋于这个最短路径经过的结点最高的权重，其它的路径就赋于相对低的权重。提出了PriorityQueue的数据结构。把权重引入到iMapReduce算法，称之为pMapReduce算法。引入权重后的pMapReduce算法比iMapReduce算法的速度更快。生活中我们做事情也要优先处理最重要的事情，道理是一样的啊！世间万事万物，无论多么复杂，抽象出来的一般规律其实都是相当地一致。易经中说：形而下者谓之器，形而上者谓之道！Oops, The ancient Tao philosophy is full Of Wisdom.

最后阶段，再次展示了一个图表，Ms. Gao完成的建立在Apache的Hadoop上的pMapReduce要明显的优于原始的Hadoop。并且也优先其它多种解决方案，其它的名字我都没听过啊！在提问阶段，尽管温武少老师积极鼓励同学提问但没有同学提出问题，大概大家都不懂吧！朝红阳老师和另一位研究海量数据的老师谈了一点点。他们提到iterative的算法在他们的实际应用中效果并不是太明显，不如memory-based的解决方案。Ms. Gao调出一张图表，解释说某种memory-based的解决方案没有迭代的算法快。

作为众多打酱油的同学中的一员，我最后的心得体会是：学术真是太高深了啊！我还是做工程吧！！！车，我有得拣咩？

/////////////////////////

Topic

Distributed Framework for Iterative Computations on Massive Datasets

Speaker

Prof. Lixin Gao

Abstract

Iterative algorithms are pervasive in many applications such as search engine algorithms,machine learning, and recommendation systems. These applications typically involve a dataset of massive scale. Fast iterative computations of the massive datasets are essential for these applications. This is particular important for on-line query such as keyword based search query. In this talk, we present an overview of MapReduce framework, and propose two frameworks, iMapReduce and pMapReduce, that enable fast iterative computations. By providing the support of iterative computations and prioritized execution, we can ensure faster convergence of the iterative process. Both iMapReduce and pMapReduce preserve the MapReduce distributed computing framework and is particularly efficient for online queries such as top-k queries. We implement iMapReduce and pMapReduce based on Apache Hadoop and evaluate its performance. Our evaluation results show that pMapReduce can reduce the computation time by two orders of magnitude comparing to that achieved with MapReduce. At the end of the talk, I will provide an overview of on-going projects in my research group.

Bio

Lixin Gao is a professor of Electrical and Computer Engineering at the University of Massachusetts at Amherst. She received her Ph.D. degree in computer science from the University of Massachusetts at Amherst in 1996. Her research interests include social networks, and Internet routing, network virtualization and cloud computing. Between May 1999 and January 2000, she was a visiting researcher at AT&T Research Labs and DIMACS. She was an Alfred P. Sloan Fellow between 2003-2005 and received an NSF CAREER Award in 1999. She won the best paper award from IEEE INFOCOM 2010 and her paper in ACM Cloud Computing 2011 was honored with “Paper of Distinction”. She received the Chancellor’s Award for Outstanding Accomplishment in Research and Creative Activity in 2010, and is a fellow of IEEE.

Date and Venue

Dec.16（Friday）13:30pm-14:30pm

Lecture Theater A101, School of Software, Sun Yat-sen University

Unix网络编程.第七章笔记

1. 有一个 Generic Socket Option：SO_BROADCAST 用来开启是否充许广播。2. 仅有TCP支持SO_DEBUG。Kernel把详细的发送包的接收包信息存在一个环路buffer里。可以用trpt查看它。3. SO_ERROR套接字选项只能被取得，不能被设置。如果进程阻塞在select调用里，不论是读还是写，select都会返回conditions set。如果进程使用signal-driven I/O，则SIGIO信号会被发到这个进程或者它所在的进程组。进程用getsockopt得到so_error后，so_error就会被重置为0.如果so_error不为0，此时调用 read 或者 write 就会立即返回-1，errno被置为so_error的值。

4. SO_KEEPALIVE在一个tcp连接没有任何收发达两个小时时会发一个prob给对方。两个钟喔。好长的时间啊！！！Richard Stevens认为SO_KEEPALIVE应该改叫 make-dead。大多Unix的kernel把这个两小时时长存为一个系统级的变量。这意味着如果你用某种方法改变了这个时长，会影响这个操作系统的所有socket.Richard Stevens认为想把这个时长改短的想法是错误理解了这个选项的意义。

5. SO_LINGER仅仅适用于面向连接的socket，比如TCP和SCTP，因此不适用UDP。

Struct linger{

int l_onofff;

int l_linger;

};

L_onoff为零是，close的behavior就是默认情况：close立即返回，kernel继续发送缓冲区内未发送的数据。当l_onoff不为零，而l_linger为0，TCP协议则会立即终止连接，发送RST给peer端，这样做避免了TCP的TIME_WAIT状态，这样的坏处是: leaves open the possibility of another incarnation of this connection being created within 2MSL seconds and having old duplicate segments from the just-terminated connection being incorrectly delivered to the new incarnation.对于SCTP, 也会发送一个ABORT chunk给peer.第三种情况：如果l_onoff为真，l_linger不为0.则close不会立即返回，会最多等待指定的时长，在这段时间里kernel发送缓冲区内的数据给peer。如果在这段时间内发送完则close返回0.如果没有发送完毕，close就会返回EWOULDBLOCK，send buffer里的数据会被丢弃。

6. MSL: maximum segment lifetime.

7. 因此close默认情况下立即返回，即使用SO_LINGER设置等待时间也在理论上存在close先于发送缓冲区的数据被peer端acknowleded的情况。因此一个更好的解决方案是用shutdown系统调用with a second argument of SHUT_WR。在调用shutdown后，用read调用直到read返回0，即收到peer端的FIN。

8. UDP没有congestion control，如果发得太快，不仅仅peer端来不及收。A fast sender can overwhelm its own network interface, causing datagrams to be discarded by the sender itself.

9. 因为tcp在三次握手阶段的SYN segment里交换窗口大小(windows scale)。所以tcp的接收缓冲大小必须在shank hands之前就设置好。Connected socket从listening socket继承这个选项。

10. TCP的套接字缓冲区大小至少需要是4倍MSS的大小。这里指的套接字缓冲，如果是针对单向的传输，比如单向传一个文件，指的是发送方的发送缓冲和接收方的接收缓冲。如果是双向的传输，则指的是双方的接收和发送缓冲。为什么tcp套接字的缓冲区需要至少是4倍MSS的大小呢？这是因为tcp的快恢复(fast recovery)算法。快恢复算法定义：如果接收端连续收到三个同样的ACK，就认为有packet丢失了。而接收端在segment丢失后，每收到新的segment就不停地重发一个重复的ack给sender。如果窗口大小小于4个segments,就不会有三个重复的ACK,所以快恢复算法就没办法工作了。

11. What is bandwidth-delay product.

A: The capacity of the pipe is called the bandwidth-delay product and we calculate this by multiplying the bandwidth(in bits/sec) times the RTT(in seconds), converting the result from bits to bytes. 即网速乘上RTT.

12. UDP没有send buffer，但有一个send buffer size.只要socket的buffer size大于LO_WATER，就永远是可写的。

13. SO_REUSEADDR有四种用途。好难写。见Unix Network Programming Section 7.5 影印版第211页啦。

14. It's Ok for the MSS to be different in each direction.

15. Nagle算法是为了减少small package的数量.The algorithm states that if a given connection has outstanding data, then no small packets will be sent on the connection in response to a user write operation until the existing data is acknowledged. Small package 指的是任何比MSS小的package.

16. Delayed ACK algorithm: This algorithm causes TCP to not send an ACK immediately when it receives data; instead, TCP will wait some small amount of time (typically 50-200ms)and only then send the ACK.

17. SCTP有一个SCTP_AUTOCLOSE套接字选项。This option allows us to fetch or set the autoclose time for an SCTP endpoint. The autoclose time is the number of seconds an SCTP association will remain open when idle.即SCTP association能够保持空闲状态的最长时间。超时就会被关闭。

人类中心主义未必导致环境的破坏.自然辩证法作业

人类中心主义是一种价值尺度。它主张把人类的利益作为价值原点和道德评价的依据，有且只有人类才是价值判断的主体。作为这个星球上唯一具备高级的情感，复杂的逻辑思维，有复杂的语言系统等等一般生物所不具备的特征的生物。人类明显区别于其它生物种类。人类要生存与发展，不断实现人类价值的提升突破，当然要利用我们生存的环境中的其它生物和自然物。

有一种观点认为：人类中心主义将会导致环境的破坏。这种观点实际上并没有真正理解什么是人类中心主义。人类中心主义最重要最核心的一点，它把是否符合人类的利益作为是非评价的标准。按照这样的价值尺度，又依据破坏环境的行为将最终损坏人类的利益，那么，破坏坏境的行为显然就不是人类中心主义所主张的！恰恰相反，人类中心主义为了维护人类的利益，就要毫不含糊地，旗帜鲜明地支持保护环境。

那么为什么会有那么多人把人类中心主义与环境保护对立起来呢？我想，是因为这些人并未真正理解人类中心主义的中心思想。一个重要的原因是一部分人想当然的把人类中心与环境保护对立起来。另一个重要的原因还在于很多伪人类中心主义者为了短期的眼前利益，为了一时一刻一朝一夕的短期发展，不惜牺牲环境。伪人类中心主义者也自称是人类中心主义者，实际上他们的所作所为与真正的人类中心主义者背道而驰。这些自我标榜的人类中心主义者并不是少数，他们在社会上为人类主心主义造成了相当的负面影响，于是影响了公众对人类中心主义的看法。

作为结论，为了从根本上长远地使我们生活的环境有符合人类的利益，有利于人类的生存与发展。我们在判断一件事是否符合人类中心主义的价值尺度时，不能仅仅着眼于眼前和局部利益，而是要站在更高更远的位置着眼于大局，着眼于长期发展。这样，就能同伪人类中心主义者划清界限。

就人类中心主义的现实意义来讲。我国在过去半个多世纪的发展，尤其是近三十年来的发展只是片面的追求短期的GDP指标，忽略了对环境的保护。这样的粗放型发展模式已经难以为继。在目前的中国社会中，因为忽视环境保护而造成的负面影响已经越来越明显，越来越频繁的表现出来。如频繁的自然灾害，如最近媒体集中关注的北京空气质量问题。这样的发展模式最终损害了人类的利益，实际上是不符人类中心主义主张的。
http://ykyi.net zausiu’s blog

新千年伊始胡温政府提出的科学发展观就是为了解决上文提到的偏差。其实际效果却不能说”成效显著”。对于如此宠大的一个十亿人口的大国，环境问题极端重要，每个人都必须意识到掠夺式地破坏坏境最终会损害社会大众的共同利益，也会损害自已的利益。

中大sicily 1426. Phone List 解题报告

1.原题中文大意;

给出一组数字，最多一万个，判断这组数字中是否存在一个数字是其它一个数字的前缀。

2.算法思想及解题用到的主要数据结构;

每读入一个数字字符串，则把该数字串从高位到低位分解成一个个数字字符。然后开始以最高位数字为树的根，第二个数字为根的子孙结点，第三个数字为第二个数字的子孙结点。如果碰到最后一个数字，如果这组数字符合要求的话则最后一个数字字符一定是叶子结点。不断的加入新的数字字符串到树中，如果在加入的过程中使以前的叶子结点变成了非叶子结点，或者新加入的数字字符串的最后一个数字在树中不是一个叶子结点，则报告这组数字字符串不符合要求。反之，则符合要求。

3.逐步求精算法描述（含过程及变量说明）;

每个结点记下关键信息,一个数字字符。再加上一个指向其它信息的指针. 这个指针指向一个node_data_t结构。结构体中放置了一个tag标签，记字是不是一个数字字符串的最后一个数字；以及一个指向STL的set容器，容器中放置了所有子孙结点。如下代码描述。

struct node_t

{

char _ch;

node_data_t* _data;

};

struct node_data_t

{

set<node_t, node_comp> _nodes;

bool _terminal;

};

剩下的逻辑便是分析数字字符串，把每个字符拆出加入树中。

4.程序注释清单

#include <set>

#include <algorithm>

#include <cstring>

#include <cstdlib>

#include <cstdio>



using namespace std;



struct node_comp;

struct node_data_t;



struct node_t

{

char _ch;

node_data_t* _data;

};



struct node_comp  // 给STL的set容器的排序算子。

{

bool operator()(const node_t& left, const node_t& right)const

{

return left._ch < right._ch;

}

};
http://ykyi.net zausiu's blog
struct node_data_t

{

set<node_t, node_comp>  _nodes;

bool _terminal;  // 标记是不是一个电话号码的终点字符了。

};



typedef set<node_t, node_comp> nodes_set;



nodes_set top_nodes;



// 如果已经可以判断电话号码不满足要求就返回false.

// 递归调用该函数。

bool odyssey(nodes_set& nodes, const char* num)

{

if ( 0 == *num )

{

return true;

}

//

node_t node;

node._ch = *num;

node._data = new node_data_t;

node._data->_terminal = !num[1];



pair<nodes_set::iterator, bool> retval = nodes.insert(node);

if ( retval.second )  // 还没有这个数字的结点.新插入的.

{

bool b;

b = odyssey(retval.first->_data->_nodes, num+1);

return b;

}

else  // 已经存在

{

delete node._data;

// 并且已经是之前一个数字串的最后一个数字字符.或者现在这个字符是最后一个.

if ( retval.first->_data->_terminal || !num[1] )

{

return false;

}

else if ( !num[1] )

{

retval.first->_data->_terminal = true;

return true;

}

else

{

retval.first->_data->_terminal = false;

bool b;

b = odyssey(retval.first->_data->_nodes, num+1); // 递归

return b;

}

}

}

// 释放内存.

void free_mem(nodes_set& nodes)

{

for ( nodes_set::const_iterator ite = nodes.begin();

ite != nodes.end();

++ite )

{

free_mem(ite->_data->_nodes);

delete ite->_data;

}

nodes.clear();

}



int main(int argc, char* argv[])

{

int case_count;   // 有多少个case

int phone_num_count;    // 多少个电话霖巴

char phone_num_str[11];   // 存电话霖巴

bool successed;  // 一组数字字符串是否符合要求



scanf("%d", &case_count);

for ( int i = 0; i < case_count; i++ )

{

scanf("%d", &phone_num_count);



successed = true;

for ( int j = 0; j < phone_num_count; j++ )

{

scanf("%s", phone_num_str);



if ( successed )

{

successed = odyssey(top_nodes, phone_num_str);

}

}

if ( successed )

{

printf("YES\n");

}

else

{

printf("NO\n");

}



free_mem(top_nodes);

}



return 0;

}

5.测试数据

用简单的脚本帮助生成测试数据.

#!/bin/bash

INDEX=0

if [ -e $1 ];then

COUNT=200

else

COUNT=$1

while [

INDEX -lt

COUNT ];do

let "INDEX=$INDEX+1"

echo $RANDOM

done

6.对时间复杂度，空间复杂度方面的分析.

本算法的时间复杂度的空间复杂度一般都介于O(logN)和O(N)之间。

//////////////// 原题 //////////////

1426. Phone List

Description

Given a list of phone numbers, determine if it is consistent in the sense that no number is the pre?x of another. Let’s say the phone catalogue listed these numbers:

? Emergency 911

? Alice 97 625 999

? Bob 91 12 54 26

In this case, it’s not possible to call Bob, because the central would direct your call to the emergency line as soon as you had dialled the ?rst three digits of Bob’s phone number. So this list would not be consistent.

Input

The ?rst line of input gives a single integer, 1 ≤ t ≤ 40, the number of test cases. Each test case starts with n, the number of phone numbers, on a separate line, 1 ≤ n ≤ 10000. Then follows n lines with one unique phone number on each line. A phone number is a sequence of at most ten digits.

Output

For each test case, output “YES” if the list is consistent, or “NO” otherwise.

Sample Input

911

97625999

91125426

113

12340

123440

12345

98346

Sample Output

YES