期末专题指导
Final Project Guidelines
在期末专题,你应就统计物理和生物的界面选择一个主题,写一篇简短的论文(按照《物理评论》的格式写2到4页纸)。文章应像正规论文一样包括题目、摘要和参考书目。文章主体要包含开始段落和结尾段落(它们是否独立成段不重要)。最理想的专题应结合著作评论、分析或计算模型的讨论以及生物数据的应用/分析三者。
For the final project, you should write a brief paper (two to four pages in Physical Review format) about a topic of your choice at the interface of Statistical Physics and Biology. It should be formatted as a regular article with title, abstract, and bibliography. The main text should contain introductory and concluding paragraphs (whether or not they appear as subsections is not important). The ideal project will involve a combination of literature review, discussion of an analytical or computational model, and application/analysis of biological data.
学生可以小组为单位合作完成,前提是要把各人的撰写的部分在脚注中清楚列明。(合作论文的的篇幅应相应增加。)很明显,第一个障碍是在短时间内找到感兴趣的主题。因此,我们希望你能思考可能的专题,并就相应问题请教Kardar教授和Mirny教授(最好尽快进行,不要迟于第16课的后一日)。
Students can collaborate in groups provided that the respective contributions of the author of the joint paper is clearly specified in a footnote. (The length of the paper may be proportionately longer in such collaborations.) Clearly the initial hurdle is coming up with an interesting problem doable in a short time. We would thus like you to think about potential projects, and consult with Prof. Kardar or Prof. Mirny regarding their suitability (preferably as soon as possible, but no later than one day after ses #16).
建议专题
Suggested Projects
以下是关于期末专题的一些建议,请具体阐述。
Here are some ideas for the final project; they still need to be made more concrete.
1.量化蛋白质结构的参比疏水性:一般认为,疏水性氨基酸在蛋白质的核中,而极性氨基酸在表面。利用蛋白质结构的数据可以构建疏水性的柱状图,作为蛋白质中心以外距离的函数。这些图显示出了普遍性质吗?什么特征能分辨出“超级家族”或折叠?膜蛋白质显示出这些不同的特征吗?
1. Quantify the Relevance Hydrophobicity in Protein Structure: The general expectation is that hydrophobic aminoacids are in the core of proteins, while polar aminoacids are on the surface. Using databases of protein structures, it is possible to construct histograms of hydrophobicity as a function of the distance from the center of the protein. Do such plots show universal properties? Are there characteristics that can distinguish between "super-families" or folds? Do membrane proteins exhibit different character?
2.量化蛋白质结构之间的相似性:给出两个蛋白质结构,我们能说出它们之间的相似或不同之处吗?要回答这个问题,你需要建立一个能找出两者最佳重叠的运算法则。其中一个可能的办法是把两个蛋白质(长度分别为I和J)中所有氨基酸对之间的“距离”最小化。

向量→ri是 结构中ith氨基酸指定的三维“位置”,f(r)是短射程函数,因分离快速衰减,比距离R大。研究这个方案的可行性,以及f(r)的最佳选择。该运算法则最起码能排列和重叠相同的结构。
2. Quantify the Similarity between Protein Structures: Given two protein structures, can we determine how similar or different they are? To answer this question, you need to construct an algorithm that finds an optimal superposition of two structures. One potential method is to minimize a "distance"

where the sum runs over all pairs of aminoacids in the two proteins (of lengths I and J, respectively). The vector →ri is a three dimensional 'location' assigned to the ith aminoacid in the structure, and f(r) is a short-ranged function that rapidly decays for separations larger than some distance R. Study the feasibility of this scheme, and the optimal choice of f(r). As a minimum requirement, the algorithm should be able to align and superpose identical structures.
3.滑动双链:波兰-Scheraga模型把所有DNA的碱基视为相同的,但假设其中一条链的 ith碱基只能固定在补充链的 ith碱基上。但是,假如所有的碱基是相同的,这两条链就可能相互滑动,在两端产生单链片段。归纳能实现这三种可能性的波兰-Scheraga模型。
3. Sliding Double Strands: The Poland-Scheraga model treats all bases of DNA as equivalent, yet assumes that the ith base on one strand can only bind to the ith base on the complementary strand. However, if all bases are equivalent, it should be possible to 'slide' the two strands with respect to each other, creating single stranded segments at the two ends. It should also be possible for the bubbles to have unequal numbers of monomers from each strand. Generalize the Poland-Scheraga model to allow for these possibilities.
4.突变动能学:蛋白质中能变异出单氨基酸,而且通过实验测量吉布斯自由能变化ΔΔG能量化它对蛋白质稳定性的影响。
4. Energetics of Mutations: It is possible to mutate a single aminoacid in a protein, and quantify its effect on the stability of the protein by experimentally measuring the Gibbs free energy change ΔΔG.
用简单的成对能量函数E = ∑ijδijU(ai, aj) 算出蛋白质自由能的近似值。如果氨基酸i和j比特定的断口靠近,δij = 1,ai是该序列中i位置上的氨基酸,而U(x, y)是氨基酸x和y之间相互作用的能量。U(x, y) 中相互作用潜能是未知的。
Use simple pairwise energy function E = ∑ijδijU(ai, aj) to approximate the free energy of a protein. Where δij = 1 if amino acids i and j are closer than certain cutoff, ai is the amino acid in position i of the sequence, and U(x, y) is the energy of interactions between amino acid types x and y. Potential of interactions U(x, y) is generally unknown.
你可以通过该方式计算出突变体的ΔE。你可以得出能使突变体中理论上的ΔE和实验测量的ΔΔG之间达到最佳状态的U(x, y) 相互作用潜能吗?
Using this formalism you can compute ΔE of a mutation. Can you derive potential of interactions U(x, y) that provides best fit between theoretical ΔE and experimentally measured ΔΔG of mutations ?
你用这个模型能得出的理论和实验之间最佳的相关性是什么?
What is the best correlation between theory and experiment that you can get using this model?
5.蛋白质中的相关突变体:
取一个蛋白质结构,以及一个匹配的多序列HSSP(译注:蛋白质二级机构数据库,下同)文档。要取得与物理数据库(蛋白质结构)相应的HSSP(多序列)文档,去 蛋白质数据库,输入蛋白质的物理数据库标识符(PDB-ID),如1ten。当蛋白质网页出现时,点击其他资源,你就能得到相应HSSP文档的链接。通过分析两个序列间的不同,找出相关的突变体。这些是相互作用的(与所选结构接近的)氨基酸中最主要的关联吗?不相互作用氨基酸之间有“诱导”的关联吗?要改进这个统计表,你可能要简化你的字母表,例如,根据疏水性、大小或电荷把氨基酸分组。找出能体现相互作用和非相互作用氨基酸之间关联性的最大区别的表示法。
5. Correlated Mutations in Proteins: Take a protein structure, and a matching HSSP file with multiple alignments. (To get an HSSP file (multiple alignments) corresponding to a PDB file (protein structure), go to the protein data bank, and enter PDB-ID of a protein (e.g. 1ten). After you get protein's web page, click on Other Sources and you'll get link to the corresponding HSSP file. By examining the differences between the sequences search for correlated mutations. Are such correlations primarily between interacting aminoacids (which are in close proximity in the chosen structure), or are there "induced" correlations between non-interacting ones? To improve statistics, you probably need to simplify your alphabet, for example grouping the aminoacids according to hydrophobicity, size, or charge. Try to find a representation that provides maximal discrimination between correlations of interacting and non-interacting aminoacids.
6.基因组中的对关联
:在作业三的问题3中,你观察了E-coli菌体基因组碱基对分离的对关联的依赖性。根据如下说明,扩展并完善该问题:利用共有信息测量碱基在距离n之下的关联。比较E-coli基因组的编码区和非编码区在该数量下n的依赖性。检查一下,在人类基因组的编码区和非编码区中,该数量情况下n的依赖性。
6. Pair Correlations in the Genome: In problem 3 of Assignment 3, you examined the dependence of pair correlations on base pair separation, in the genome of E-coli. Expand and refine this problem as follows: Use mutual information as the measure of correlation between bases at a distance n. Compare the n dependence of this quantity for coding and non-coding regions of the E-coli genome. Check if the same correlations are observed in coding and non-coding regions from the human genome.