Chapter 17: The Challenges of Big Data; Developing "Clinical Trial Grade" Data to Foster Data Use
In this chapter, Dr. Mills talks about some of the practical and intellectual challenges that big data poses for research. He begins by explaining that it is a big challenge to create mechanisms and funding to simply sharing massive quantities of data (most generated from patient samples). He notes the costs of not sharing data. Dr. Mills states that this amount of information "has unbelievable promise," but it has to be linked to new strategies of collecting data in clinical trials and shared in "curated" formats in order to be meaningful. He defines the concept of "clinical trial grade" to define trials specifically designed to leverage their data. Dr. Mills also talks about the energy wasted in repeating work, giving the example of research studies that require documentation of cell lines whose pedigree is already proven. He talks about the idea of "curating" data Next, Dr. Mills gives specific examples of strategies for data collection and use. He talks about his own research group's activities developing their own cell lines with a known pedigree. He then talks about the institution-wide effort to sequence patient tumors and discusses what is involved in putting this information into a usable format, noting that MD Anderson currently has a program in place to accomplish this. Next he discusses why the patient information currently held at MD Anderson is not at clinical trial grade (noting that this has nothing to do with quality of care, but only with techniques of data collection). He explains that Google wants to create its own hospital to guarantee the strictest quality of data collection.
The Historical Resources Center, The Research Medical Library, The University of Texas MD Anderson Cancer Center
The Interview Subject's Story - Overview; The Researcher; MD Anderson Culture; On Research and Researchers; Understanding Cancer, the History of Science, Cancer Research; Ethics; Definitions, Explanations, Translations; Overview; MD Anderson Culture; MD Anderson History; Research; Institutional Processes; Devices, Drugs, Procedures; The Institution and Finances
Gordon B. Mills, MD, PhD :Now the other question of sharing of data and information is one that is a major challenge. The amount of information that we are generating today, in terms of DNA sequencing, RNA sequence, RNA copy number, just generally masses of data across samples, across conditions, is such that probably each month, maybe each two or three months, we exceed the total amount of data that had been generated by all scientists over all time.
Tacey Ann Rosolowski, PhD:Oh my gosh.
Gordon B. Mills, MD, PhD :So this type of data --and I'm careful to use the term data, it is not information, it is not knowledge, it is just simply data-- and that requires then, a totally new way to think about how we move that forward. That data has value and indeed, there are a number of institutions that look upon the data that we generate at this level, as a potential for monetization. Can it be used with industry or with other approaches, to generate money that can then be put back into the system, to improve what is happening, or should it simply be shared with the community broadly, under two or three sort of justifications? The first one is, is that the information comes from a very limited number of sources. One of them is, is most of it is from patient samples, and so we have a covenant with our patients who provided those samples, to help cancer research, to improve outcomes for them and for the next generation of people. It was paid for by tax dollars most of the time, frequently by philanthropy being part of it, and both of those have the goal of improving outcomes in the future. So you have this question of do I use this information to generate funds that will help accomplish that goal, the monetization concept, or do I simply put it out for the world to work on, with the assumption that that data should be shared and open? It is a major challenge. There is this two different concepts, both of which have value. You can certainly tell, from the tone and the terms I use, that I believe this should be shared, and that this is a strong point of view. The other part that comes with this is sharing is hard, and so data has to be carefully curated, put in formats, put in ways that can be shared so that mistakes aren't made in the interpretation of it, and there is very little money available to make that happen. So it is not straightforward, of saying here is the sequence data that I've generated on a thousand patients, and just go at it. That isn't going to work, and so we need to figure out this whole concept of how you support this process, how you build the culture so that you can use the concept, and how you make it move forward. One other argument that I have been pushing and leading, along with many others, I'm not applying it's me specifically, is the idea that this incredible amount of information that we can generate has an unbelievable promise to transform how we move ahead, this idea of big data, big data sharing. However, the one great limiting factor in my mind is the fact that this data will not, or this approach, will not fulfill its promise unless it is linked to high quality clinical outcomes data, and I use the term "of clinical trial grade," meaning that imaging is taken at appropriate times, that the images are properly done and evaluated. The patient is followed throughout their career. We know what they are treated with, when and how, and all of that information is then integrated with the deluge of information that we are generating around the genomics of that tumor, et cetera. Now, the other challenge that comes up with our lack of sharing is unacceptable redundancy and cost, and a good example is that there are approximately two thousand cell lines in existence, and I would say that most research is done on three hundred of them. Those three hundred cell lines, in hundreds of different labs, have been characterized for proteomics, invasion, proliferation, for genomic sequence, and that has been done because one has to do it, but frankly, the cost of that and doing it over and over again, in different laboratories, is unacceptable. So, the failure to share just the sequence data on the cell lines, and other resources in a similar manner, results in this challenge of having to redo things in different labs, with basically funds from the public. And so there has to be a better way to figure out how to share all of this information.
Tacey Ann Rosolowski, PhD:Now, when you're talking, when you gave that example of the sharing information about the cell lines, were you referring to individuals within MD Anderson, who are repeating that, or in general, in the community of biomedical sciences?
Gordon B. Mills, MD, PhD :Yes to both.
Tacey Ann Rosolowski, PhD:To both, yeah.
Gordon B. Mills, MD, PhD :To both. And some of this is new and emerging technologies that require things being done differently, so it's not strictly saying everything done on cell lines is redundant, but rather, some of the very hardcore, unchanging events in a cell, such as its DNA sequence, which changes a bit but very little. Its DNA sequence could be done and then released and that's it, or it could be required that it be done, released and that's it, and put in a format that the world can use. Now there are groups that are doing this and are doing a wonderful job of it. The Broad Institute has developed the Cancer Cell Line Encyclopedia. Cosmic has a similar effort. We have released an enormous amount of data, but the idea that most of the time it's not known and not coordinated and not easily found, has been a problem.
Tacey Ann Rosolowski, PhD:Are there any individuals who object to such an idea on intellectual grounds, I mean feeling that it's absolutely essential that all those activities be within a single study method?
Gordon B. Mills, MD, PhD :I think in some cases, this is required by reviewers of journals, to say did you check your variant of this cell line, and I think in some cases that's true. I think that the general information, if done properly, would be useful, but the biggest problem, if I run an experiment that I want to put into a paper, it gets constrained, it gets done by a postdoc, and that data then sits on the postdoc's desk. For me to take that data and then make it available to the world in detail, in a rapid manner, outside of a paper, is something that is not funded and supported, and we do not have, in general, the resources to do this.
Tacey Ann Rosolowski, PhD:This may be an inappropriately specific question, so you can tell me, but I'm curious, what are the kind of activities that are required to take this raw data and curate it, is the word you used.
Gordon B. Mills, MD, PhD :There are many different steps. You start off, and let's just use cell lines as an example. The first thing is, is the cell line really what you think it is? The community now knows, from studies that have been done in many centers for two generations, that many of the cell lines that we use are mislabeled. They have been mixed up over time. They have either been mislabeled because you write the name down and you transpose two letters over time, or they've been mixed up because someone picks up one vial and another vial and mislabels it, and also because of care of use in that some cell lines grow very fast and if you get one cell into that preparation, it will outgrow. Other times, the history of the cell line is very poorly annotated and it will be labeled as an ovarian cancer cell line when it's a cervix line, or a melanoma line. So just even that step, of curating and knowing what you're working with, is massive. The second is, is as you generate data where do you put it, where does it go? How do you quality control it? An example is that I trust sequencing from some centers and I trust data from some centers, and frankly, I don't trust it from others. We do things differently. I happen to trust the way we do things, not necessarily that it's better but it's what I'm used to. We have approaches that I think we've set up to be high quality control, but we know there are errors in the data out there. Joe Gray, who I mentioned a few moments ago, and myself, have talked about writing a paper on the problems of sequence data available on cell lines, because we went in, doing a particular study, and the correlation across the major databases for even well-known, well-characterized mutations, was very low, sufficiently low, that clearly, errors had crept in from different places. Sometimes it's simply how they annotate sequence, sometimes it was mixed up cell lines as I talked about, and sometimes it's older technology that has now been superseded, and you don't know exactly what was done in another group. So all of those have resulted in some concerns about how you manage this information. If we take the cell line question, the community, with support from NCI and other places and our own group, have now started to develop new cell lines, where we have a much better pedigree of the cell line, we know what patient it came from. We have better annotation and now we have technology that allows us to know whether this is the tumor that we started with, has it changed over time, things that we could not do ten, fifteen, twenty years ago. So all of these things are happening, but this is just one example of the microcosm of how you do this. Another example. There are major sequencing efforts of patient tumors across many different centers. MD Anderson has done sequencing data on twenty thousand tumors, give or take, but with different technologies, different approaches, different quality control, clear questions around how mutations were called, how they changed over time. And so if we were even just to put that data, or try and put that data, into a usable format, it is a massive effort. But even more importantly then, linking that to high quality outcomes that I mentioned earlier, again, is an effort beyond the capability of any group. It would be an institutional process, and indeed we are trying to do that, but the effort required to curate, quality control, remove bias, make sure everything is accurate, not mixed up, is massive and needs to be supported. Now, we have a major effort at MD Anderson to put this in place, and it is a critically important process for the future. I will argue that we have not put as much effort, as I mentioned earlier, into attaining the high quality clinical trial grade outcomes that are necessary, partly because we don't have the resources, partly because that's not how we manage patients, partly because we thought we could get this from the patient record, but I think more and more studies are showing that our current patient records are inadequate in the term of clinical trial grade data. So, we're having to rethink a lot of how to make this whole process valuable.
Tacey Ann Rosolowski, PhD:What's required to improve the -- to bring the grade of patient information up to what is appropriate for clinical trials? [00:031:38]
Gordon B. Mills, MD, PhD :I think there's two things that are required. One is culture and the second is money, and will. Culture and will here, I think go together. But the culture is, is that we manage patients inside of a clinical trial in a very constrained, very rigorous, very expensive process, and that patients that we are caring for in our normal care get outstanding care. This has nothing to do with quality of care. But our annotation and our follow-up, of every step in that patient's career as a patient, or lifespan as a patient, whatever term you would like to use, is one that we don't do in the same rigorous manner. It is a time consuming and expensive process and so it would be a culture shift. And when I talk about expensive, I don't mean just dollars to do the imaging at the right time, but you're going to have to free-up physician time, nurse time, to do that extra work. It is expensive in time and dollars, to attain that type of information, but again, it is absolutely necessary if we are going to truly fulfill the promise of the data that we are generating.
Tacey Ann Rosolowski, PhD:Who's talking about this now?
Gordon B. Mills, MD, PhD :Lots of people. There are a number of people who talk loudly. I've been at several big data conferences, of the people who work on the big data, and have espoused that particular point of view. Many other people understand this, are putting the same thing forward. I actually recently heard a presentation from Google, which are one of the very big data analytic groups in the world, the medical branch of that is called Google X or Verily, and they have actually discussed the potential of building or buying, their own hospital, so that they can simply say, we are going to get this type of data, because we have found that the data that we get from the community does not have the depth of information, the granularity that we need to make progress. And to hear an analytics group, a pure information group, saying we know that this is the rate limiting problem, says that I think we're having an impact. I was asked recently, to participate in a debate at one of the major meetings in Europe, simply about this question of what it is that we need to do to benefit from the information that we are generating. So there is an understanding, and so please don't in any way think I'm leading this or I'm the only person.
Tacey Ann Rosolowski, PhD:Sure.
Gordon B. Mills, MD, PhD :This is a major discussion. The challenge is how are we going to pay for it.
Tacey Ann Rosolowski, PhD:How to make it happen, yeah.
Gordon B. Mills, MD, PhD :And how are we going to cause the culture shift. This again, fits into this concept of team science and how do you fund it, how do you support it. How do you make these big efforts work?
Mills, Gordon B. M.D., Ph.D. and Rosolowski, Tacey A. Ph.D., "Chapter 17: The Challenges of Big Data; Developing "Clinical Trial Grade" Data to Foster Data Use" (2016). Interview Chapters. 76.