Sharing proteomics data, trickier than it seems

Reading the blog post from Cameron Neylon about how research incentives should align with research outcomes made me see a clear relation with the problem of sharing data in proteomics.

Accessing genomics or transcriptomics data is more or less straightforward compared to proteomics data. You go to a data repository and download whatever you are looking for. Anyone who tries to do something similar in proteomics usually ends up downloading spreadsheets with incomplete data from supplemental material of published articles.

It’s frequent to hear proteomics informaticians complaining about experimentalists not making their data easily available and how they could be so greedy of not publishing the data for what they have been funded with public money. But jumping into conclusions about experimentalists being greedy or sloppy is not fair to me. The real issue is far deeper than that.

First of all, in order to share proteomics data the first requirement is to have the proper infrastructure. Lately there has been some projects aiming to provide the sharing infrastructure. I would say that the most popular proteomics repositories are PeptideAtlas, Proteinpedia, PRIDE and ProteomeCommons.

As far as I know, the main way Proteinpedia and PeptideAtlas handle data upload is with manual curators who study the format and the way the proteomics experiments were done. Then they come up with the best way to put the data in the repository. Usually is a combination of custom parsing with some manual editing to make experiments consistent in the repository. Obviously this kind of method is not that scalable for every proteomics experiment out there. In order to cope Proteinpedia focuses on human submissions only, whereas PeptideAtlas team is already looking into other solutions to facilitate data submission.

PRIDE, in the other hand, accept only data submission in PRIDE XML format. The generation of this XML format is not supported by most of the proteomics tools experimentalists use. However the PRIDE team provide an application that tries to convert every format out there and every experiment design to the PRIDE XML format. But because they try cover every single aspect of every experiment, I’ve seen experimentalists struggle when trying to fill all the forms. Moreover, they also don’t like the rigidity of the model to depict what the proteomics experiment was about. It’s true that there are some optional controlled vocabulary terms to give flexibility but still experiments have difficulties wrapping their head about how to use those terms.

This problem is not exclusive of PRIDE. Each proteomics lab uses its own mass spec terminology and frequently forms designed by developers with no first-hand experience making experiments don’t match the terms the experimentalists would understand. After all, experimentalists like to spend their time with experiments, they don’t like spending their time in things that are considered bureaucracy. The PRIDE team is aware of this problem a keeps trying with more familiar ways for experimentalists to fill form data.

ProteomeCommons seems to be the repository getting most traction. It’s currently where most proteomics experimentalists are submitting their data to fend off journal editors’ complains about the lack of published data. ProteomeCommons is built on top of the Tranche network, a kind of global distributed filesystem, that potentially offers infinite scalability to store data by just adding more nodes to the network. The tranche network looks just like a big hard disk with several files. Everybody can upload anything they like. That’s where ProteomeCommons comes in place, it’s the web gateway to upload the data, the web application offers some options to annotate the data, but it’s not as complete as what you have in the other databases. That’s understandable because if they bothered the experimentalists with thousands of forms with mandatory fields, the experimentalist wouldn’t submit their data to ProteomeCommons. It’s also worth mentioning that ProteomeCommons is not required to use Tranche network, any other proteomics repository could use Tranche network as a backend to store huge proteomics files. The rest of the repositories are currently looking into using the Tranche network to store the huge amounts of proteomics data, specially derived from raw data.

You might have notice the conspicuous absence of format standards in my description about the different repositories. If everybody used the same proteomics standards the infrastructural problems to share data would have been solved. Right?

The major effort to standardized proteomics formats is being carried out by HUPO-PSI. It’s a kind of consortium where they have regular meetings where representatives of different proteomics groups among the world agree on what has to go in the standard and how. You can follow the discussions and chip in for what would you like to have in the formats.

Aside of the typical problems of something designed by committee, it remains to be seen if mass spec vendors and proteomics software developers will fully embrace the standards. Proteomics data is highly heterogeneous by nature, there are very different kinds of proteomics experiments depending on what is the research being done. High quality proteomics experiments is not something that can be converted into an assembly line process where everything can be easily is fixed and standardized.

However the new stable format releases from HUPO PSI look good enough to me to at least start making the data exchange among repositories possible. There is also the promise from several mass spec vendors of future commitment to fully support the standards. I hope all those promises don’t end up in just that.

In my opinion all these infrastructural difficulties are going to be solved somehow relatively soon. The problem is that I don’t think that just by solving the infrastructural problems everybody will start sharing data transparently. There are other difficulties.

The main reason experimentalists are, at least, uploading their data to Tranche is to avoid being bugged by proteomics journals into making their proteomics data available. Many proteomics journals are getting really serious about this making the data available. Why proteomics journals are so interested in having the authors making their data available?

One could argue that editors of these journals believe in the moral imperative of making the data available but I don’t buy ethics as the main reason. The majority of biomedical scientific journals are still for profit companies, not academic institutions. In order to survive they have to make money selling something as every other company. For a journal publishing articles with lots of citations from other journals, with lots citations themselves, are the best way to guarantee that companies and institutions will keep renewing the yearly subscriptions.

It’s not something that I can’t demonstrate with facts, but lately I’m getting the feeling that proteomics is being disregarded by people in other biological fields as low quality research. After all proteomics is just a technology, a tool, to find out biological insights. Mass spec research by itself wouldn’t get so much funding if it couldn’t be used for biological research. What I think it’s happening is that biologists are taking less seriously biological findings in pure proteomics journals. Most published proteomics experiments are irreproducible and if you start digging into the published data you frequently find many false positives. That’s why proteomics journals editors are enforcing the experimentalists to release their data and make it as transparent as possible. They hope they can gain more credibility and get more citations from non-proteomics journals.

But still one can see that most experimentalists are reluctant to make their best data fully available. Many informaticians trying to analyze the data think that this attitude is because of cultural resistance to change. Many of these informaticians try to evangelize the experimentalists about why is so important to share data. Among evangelists the most notable group is the Fix Proteomics Campaign, which proposes some habits to make proteomics more credible.

But experimentalists are not dummies. I have seen them changing really quickly any habit if they find something better. The problem is that making their data transparently available is worse for them and here is where I think the campaigns miss the point. Let me explain something unique about mass spectrometry proteomics that seems easily forgotten by many people.

Mass spectrometers are really expensive instruments. Getting the adequate skills to operate them takes several years of training. To make things more costly these instruments become obsolete in a matter of few years because there new ones are constantly new ones coming up with better features. When a new instrument arrives to the lab a lot of time is spent optimizing it and learning how to troubleshoot it. If you don’t keep getting those new mass spectrometers you are left behind by the competitors because they can get advantage of more powerful instruments.

How in academics is possible to maintain a high funding inflow? In a company you have to sell a a service or a product but you can’t do so in an academic group. Usually you rely on grant agencies to provide funding. Granting agencies grant money by scientific productivity of the group. Publications in reputable journals are the main tangible measurement used by granting agencies for scientific productivity.

But most journals, as I said before, have to operate like companies. Proteomics data by itself is not publishable, if there is no story with some biological insight or some novel way to improve results, what will you write in a proteomics paper with just high quality data? Generating high quality proteomics data is damn difficult, I would say even more difficult than to come with fancy analysis of data. Let me explain.

People coming from genomics and transcriptomics fields sometimes forget that the chemical nature of proteins is much more diverse than DNA or RNA. After all, DNA and RNA have more or less homogeneous chemical properties regardless of its sequence. Proteins, in the other hand, are chemically completely different from each other depending on the sequence - that’s why they can carry out so many molecular functions. A proteins from the nucleus are completely different than the proteins from the cytoplasmic membrane.

The proteome is also much more dynamic than the genome. The same cell under different conditions show completely different proteome profile. You also have to take chemical modifications of proteins into account, which are only detectable by probing the proteins directly. Chemical modifications like phosphorylation act as functional switches for proteins, a protein with a modification has also different chemical properties than the same protein without modification. The heterogeneity of proteins makes protein purification, proper separation, and identification by mass spec an entire field by itself.

So an experimentalist, who has had years of training just to be able to identify proteins and chemical modifications, might, understandably, lack the skills for sophisticated analysis that will make the story sexier for proteomics journals. Software to analyze proteomics data as a power user, without programming knowledge, is still in the early days. As a developer I can see how difficult it is to make analysis software that covers every kind of proteomics experiment with a point and click interface.

The most logical step for an experimentalist when having good data would be to look for people specialized in analysis to come up with a powerful story. Usually the best people analyzing are independent proteomics informaticians that would do the analysis only if they get the credit for it. After all they have to also have to get funded to keep doing research and they can always claim they are the ones writing the paper. But even if proteomics informaticians give proper credit to the data generators in their papers - which I wouldn’t say it’s always true -, very few granting agencies will keep the experimental lab funded just to generate data that other people will use to write publications.

To tackle this problem the the most powerful experimental proteomics labs are trying to aggressively hire programmers who can do analysis in-house so that the credit remains within the group. The first problem these groups face is that there are almost no proteomics informaticians in the job market. They have to invest in people with programming skills who will eventually get the proteomics knowledge necessary to make useful programs for analysis or to be able to analyze the data themselves.

Also, because of the lack of programming knowledge of the experimentalists it can be tricky for them to envision which potential programmers will have the skills required to become a good analyzer, not to mention how to motivate the programmers with an excellent about why to join their field. There are also programmers, where I include myself, that precisely look actively for this highly experimental labs instead of pure informatics groups in order to get a closer understanding of experimental data by interacting directly with the experimentalists.

But this kind of setting where data generators and informaticians try to work together is propitious to get into Dilbert-kind of situations. I would say that it’s mainly because of the misunderstanding generated by the technological gap.

I feel fortunate of working in my current group because I think it’s one of the few experimental proteomics lab where people are aware of this problem and actively try to improve the communication with the informaticians.

The final point I want to make in this post is that if the data generators were rewarded properly for what they are good at, generating unbeatable high quality, sharing proteomics data transparently, and as soon as it’s generated, would become mainstream. I see granting agencies rewards as the main cause for not sharing data because they usually don’t reward the generation of data accordingly. They should also reward the labs that not only make the data available but make it as accessible as possible for data analyzers, so that proteomics research field would advance much faster than it’s currently doing. I acknowledge granting agencies are changing slowly for the better but there is still a long way to go.

But understanding how granting agencies work and what are their motivations is still something quite fuzzy to me. I think I’m still not old enough to understand the politics behind research funding.