Adventures of a Computational Scientist in the King Lab

Russell Schwartz
Professor and Head, Computational Biology Department
Professor, Department of Biological Sciences
Carnegie Mellon University
4400 Fifth Avenue
Pittsburgh, PA 15213 USA
russells@andrew.cmu.edu

Introduction

Although I was not in Jonathan King’s group for too long, it was an influential time for me and I am proud to be counted among the lab’s alumni. I was probably an oddity among the King Lab alumni, in that I did not do any bench lab work in my time there. I am a computational biologist and I went there to do computational work while embedded in a wet lab. With the exception of Jon’s lab coordinator, Cindy Woolley, everyone else in the lab at the time was a bench scientist first: Patricia Clark, Cammie Haase-Pettingale, Ajay Pande, Steve Raso, Claire Ting and, of course, Jon himself. When I joined the lab, I had just finished my Ph.D. in Computer Science at MIT, working in computational biophysics related to phage assembly. I went to the King Lab to pursue a different kind of computational biology, though, moving from biophysical modeling to bioinformatics. I was not formally trained as a bioinformaticist, although hardly anyone was at the time, and Jon was not an obvious mentor for a bioinformaticist. This was well before MIT had a Course VI-7 (Computer Science and Molecular Biology) major, so I had largely designed my own majors in computational biology by adding what I thought I needed on top of MIT computer science degrees. I had taken what at the time was the only computational biology class at MIT, taught by Bonnie Berger (my thesis advisor) and David Gifford, and had done a little sequence analysis work with Bonnie although it was not my main area.

I was not a complete stranger to the King Lab and their work. Jon and Bonnie had worked together on capsid assembly, which became the focus of my graduate studies, and I had gotten to know a number of people in the phage assembly community, many of them Jon’s academic descendants. I had also previously collaborated with Jon remotely, in a summer internship with Sorin Istrail (then of Sandia National Labs, now of Brown University) using highly simplified computer simulations of protein aggregation to look at sequence properties that might influence a protein’s ability to avoid aggregation during folding. That had led to a paper with Jon and Sorin (Istrail et al., 1999) that inspired some of the questions I planned to pursue as a postdoc in Jon’s lab.

It was also not my first time in an experimental biology lab. I had spent a summer in the lab of another of Jon’s alumni, Peter Previlege, learning to do some experimental phage work. That was also a great time and a wonderful learning experience, although quite different from my plans in the King Lab. I came to the Previlege Lab to experience wet lab work first-hand, while I went to King Lab with a somewhat ill-formed plan to do computational work while embedded in an experimental lab.

Still, it was an unusual setup and one I am grateful Jon was willing to consider. Today, it might not be so strange for an experimental lab to have an embedded computational scientist. It was quite exceptional at the time, though, particularly for a lab that was not working with any of the “big data” technologies of the era. I had some big gaps in my knowledge and experience relative to the rest of the lab. (I recall once taking a phone message for Steve Raso from Carl’s Ice, who I was guessing must be the lab’s dry ice supplier.) I think, though, that Jon and I both appreciated in different ways that a new kind of biological science was becoming possible through the beginnings of what we would now call data science coinciding with the pioneering work of many individuals to build publicly available repositories of biological data, which were just then starting a period of rapid growth that has continued to the present day.

I do not know if Jon expected to get anything useful out of me being there, aside from doing a favor for me and Bonnie. For me, though, it seemed in part a chance to follow up on some of the directions Sorin and I had started at Sandia. More than that, I thought it would be a chance to learn more biology from Jon and from the rest of the lab and to figure out how a computational biologist can best contribute to experimental biology and develop more of the perspective of a bench scientist on the topics that I was studying through simulations and data analysis.

Computational Science in the King Lab

The actual science I worked on took a few turns even in the short time I was there. We had started the work with some hypotheses about protein folding, aggregation, and their influence on sequence evolution, inspired by simulation studies I had done with Sorin at Sandia. Those hypotheses would likely be impossible to study experimentally, but we had an idea of how they might be approached computationally by looking at properties of the proteome as a whole, reflected in databases of protein sequences. At the time, no one lab had anything approaching the scope of data to ask a question like that, but sequence databases that were begun long before had started to take off and it was becoming possible to ask questions about the space of protein sequences as a whole.

Much of the excitement for me came in sitting down with Jon to discuss interpretation of results and how they would lead to the next set of questions. Jon and I would formulate hypotheses about protein biophysics and evolution and I would go off to try to translate them into statistical questions on sequence databases, write code to do the analysis, and then see what we found. For example, one early project started with a result from our coarse-grained simulations, suggesting that particular ways of distributing amino acids along a protein chain ought to, on average, help protect them from aggregation during folding. Given the simplicity of our models and all of the other selective pressures acting on a protein, we knew there would be at best a very weak signal to be found in real proteins. But if there were a real effect then it should show up as a measurable bias in a large enough data set. So we would look at the space of all the then-sequenced proteins and, if we were lucky, see the bias we were looking for on top of a variety of other unanticipated results. Then I would go back to sit down with Jon or other lab mates, notably Claire Ting, and tell them what I was seeing in the numerical data, and talk about what it could mean about protein biophysics. Sometimes what seemed interesting was just an artifact we could only appreciate with Jon’s depth of knowledge: for example, a chance effect because some kinds of sequences were heavily over- or under-represented in the sequence databases, for example. Other times, it was something truly unusual and unexpected that Jon might be able to connect to some aspect of protein structure or folding that was completely unknown to me. Although I would not have described Jon as a data scientist at the time, he did not shy away from thinking and talking about data. When I talked to him about statistical properties of sequences, he understood what the numbers said about the sequences and, more than that, could connect it to a deep insight into what that in turn might mean about the physics of actual molecules in solution.

For me, it felt like a Wild West of sequence analysis: databases of sequences just sitting on the internet on which many of the simplest questions had never been asked. Jon, Clair, and I could discuss some idea, I would figure out how to phrase it as a quantitative or statistical question, write some code, and let it run. The databases then were in a sweet spot for this kind of work: big enough to be interesting but not so big that we could not download all the protein sequences then known and write some code that could pose basic questions about them on a standard desktop computer in minutes. My experiments did not cost anything, use up any reagents, or take much more time than was needed to figure out what we wanted to ask and how to ask it, so it was a rare opportunity to try many ideas and just see where they went. We got a few papers out of this work — one from the question we actually set out to study (Schwartz et al., 2001a) and another from just an odd observation that we decided was interesting (Schwartz et al., 2001b), as well as a follow-up paper a few years later (Schwartz and King, 2006).

Learning to be a Scientist

For me at least, the lessons I took from the experience on how to be a scientist were of much greater impact than any scientific findings we might have stumbled on. What I took away from my time in the King Lab was probably fairly different from that of other lab alumni because I was learning how to be a computational biologist. One perhaps obvious thing I learned to appreciate better is that you cannot do good computational biology without knowing a lot of biology. Of course, Jon has vast knowledge of and insight into protein biophysics and I would like to think at least some of that rubbed of on me from talking to him and from others in the lab. At the same time, though, I learned that the biology a computational biologist needs to know is not necessarily the same as an experimental biologist needs to know. There were many basic things my labmates no doubt knew that I did not and probably never will know. But I also learned that there was a good deal I needed to know about biology and biotechnology that most of my labmates might not have known. When trying to reason about why databases of protein structures showed this or that bias, it could become greatly important to know where those databases came from: how one learned a protein structure, which kinds of proteins might be easier to solve and which harder, how the sequences in the database might differ from their wildtype versions, or why there might be many entries for one family of proteins but few or none for another. Jon knew a lot of these answers and, more than that, could appreciate why they were important, but I expect a lot of what he knew would be considered obscure trivia even by most experienced protein biophysicists.

I also learned that a good computational biologist needs to know a lot of computer science and again not necessarily the same computer science a computational expert in another domain might need to know. It was really in the King Lab that I began to think about problems statistically, something that I did not learn about at all in my formal studies but can now appreciate is so central to computational biology. Today, I would expand that lesson to encompass machine learning, which is most of what I and many other computational biologists actually do but was a pretty specialized spinoff of artificial intelligence then. There is much more to it than that — specialized disciplines of computing pulled from applied math, computational physics, statistical computing, chemical engineering, operations research, and a host of other areas — some of which I learned before coming to the King Lab and some in the years since. Much of it one only figures out, though, by getting stuck on an interesting problem and going hunting for a way to get unstuck.

More than any specific facts or disciplines, though, I think what I learned from Jon is that scientific knowledge does not come from deliberate study so much as from the love of pursuing it. I do not think you get to the kind of encyclopedic knowledge Jon has because you set out to do so, but because you followed so many scientific threads out of a love of seeing where they would go and listened attentively to so many other people’s stories because you wanted to know where their scientific journeys led them. I could learn a lot from Jon because I was excited by the questions we were trying to answer and I was lucky that this led to me incidentally learning some new tricks along the way. Although my scientific interests have gone pretty far afield of what I was studying with Jon then, I would like to think that insofar as I have been successful as a scientist it was because of this: not out of any plan for what I should work on but just from following interesting problems where they led me and ending up now and then stumbling onto something significant.

Learning to be an Educator

Being in the King lab also helped inspire another major theme of my work in the years since: education, and particularly quantitative and computational education in the life sciences. As a student trying to become a computational biologist without any formal degree program, I often found myself asking what I would need to know to be an effective computational biologist. Experiences like my time in Jon’s lab helped me fill in some of those pieces. It is a question one never stops asking in research, but for an academic it becomes a question one ends up asking mainly about trainees. What does a student being trained today need to know to be an effective computational biologist? What would they need to know to be able to do my job? Or to be able to do any of the other jobs that call for a computational biologist today? Jon’s lab was for me a place where I could really start to think on a day-to-day basis about what a computational expert can add to a largely experimental interdisciplinary team and to build my own knowledge of what I found useful, which years later would inform my own work on course and curriculum development.

Technologies, and even whole areas of research, come and go, but I think the most important lessons transcend these passing trends. What I feel I most needed to learn from Jon, and have since strived to pass on to my own students, is some grasp on how a scientist like him thinks about a problem. Maybe it is easier to appreciate for an interdisciplinary scientist like myself, but different kinds of scientists think very differently about problems. It is not just a matter of knowing different facts or techniques, but having a different aesthetic for what an interesting question looks like or an interesting answer to a question, how to pose and solve it, or what evidence is considered persuasive that it is solved. These viewpoints are very different for a typical experimental biologist versus computer scientist, for example, or for that matter a physicist, an engineer, a mathematician, etc. A computational biologist needs to be able to communicate with people in all of those groups, not just to share facts and data but to understand what would be interesting and exciting to them in a research direction. Talking to colleagues in the King Lab was a great way to develop that appreciation myself. And Jon himself was a great mentor for working through some of these issues because he was patient, insightful, and could bridge those divides in thinking himself and help me see ways across them. This is a lesson I still try to share with my own students, however imperfectly, for example in teaching them how to give a scientific talk.

The other side of my thinking about education that my time with Jon helped inspire was asking what someone like me could teach experimental biologists. There is plenty I learned getting degrees in computer science that would be useless for the average lab scientist, and a great deal that would probably always be best left to computational specialists. But I could appreciate even at the time that a lot of what I brought to the lab were skills that my experimental labmates could have learned to do themselves and that, I believe, could have been very valuable to them as bench scientists. There is nothing too complicated about the main ideas behind statistical reasoning, writing a simple computer program, or putting a thought experiment into the language of mathematical models so one can reason through it more rigorously. They are things one needs to learn, though, and were not (and still largely are not) part of how experimental biologists are trained.

Talking to Jon and my fellow lab members about their science and where I could see possibilities for computational approaches to enhance it was a great opportunity to think about what I knew as a computational scientist that would be useful and accessible to an aspiring bench scientist. I did not have the time in the King Lab to try to develop any of these ideas as an educator, but I have since had plenty of opportunity to think about them in my own work designing courses and degree programs and in my efforts to work with the broader bioinformatics education community on curriculum reform more broadly. My experience with Jon was an important influence in my own decision to go to a biological sciences department rather than a computer science department when I started my own independent academic career, which has since given me the chance to design courses to introduce bench scientists at various levels to computational biology. I still believe that there is far more that experimental life scientists can and should learn about computational, statistical, and mathematical tools if they are going to be the best experimental life scientists they can be. I do not have all of the answers for how to get there, which is arguably a harder but more impactful challenge than any of the scientific questions I have ever worked on. But maybe one last lesson where I was able to glean a bit from Jon is that sometimes one needs to fight for an important cause, even if success might seem a long way away.

Conclusions

The short time I spent in Jon’s lab was a wonderful and formative experience. I wish I still had the opportunity to recommend Jon’s mentorship to others, but he and the environment of his lab cannot be duplicated. Still, an aspiring scientist might find similar opportunities with other mentors to embed themselves in an environment distant from their own training. With the right lab, it is an experience I would highly recommend to any other aspiring interdisciplinary scientists. Working as a computational scientist in an experimental lab helped me appreciate how much we can learn from people whose knowledge and experiences are very different from our own, even if those differences might at first seem like an obstacle. I do not think an experience like I had in the King Lab would work out for everyone or in every situation. Ultimately, though, I believe it was a great experience for me because Jon was an ideal mentor for it: a broad thinker and a great scientist, but also a good person who was open to being a patient guide to an outsider like me.

Bibliography

Istrail, S., Schwartz, R., & King, J. (1999). Lattice simulations of aggregation funnels for protein folding. Journal of Computational Biology, 6(2), 143-162.

Schwartz, R., Istrail, S., & King, J. (2001a). Frequencies of amino acid strings in globular protein sequences indicate suppression of blocks of consecutive hydrophobic residues. Protein Science, 10(5), 1023-1031.

Schwartz, R., & King, J. (2006). Frequencies of hydrophobic and hydrophilic runs and alternations in proteins of known structure. Protein science, 15(1), 102-112.

Schwartz, R., Ting, C. S., & King, J. (2001b). Whole proteome pI values correlate with subcellular localizations of proteins for organisms within the three domains of life. Genome research, 11(5), 703-709.