Science - January 14, 2016

Data is ‘everywhere and nowhere’

Roelof Kleis

Researchers are supposed to store their data so that it remains easily accessible over a long period. But how and where exactly is up to them. That lack of central coordination has led to a wide range of solutions — with the inevitable ensuing risks.

The new Hyperion data centre, behind the Actio building, was opened in May 2014. It is possible to store data centrally there, but many chair groups think it is expensive.

For more than a year now, chair groups and individual PhD candidates have been required to draw up a data management plan. Such a plan gives a detailed description of how research data will be stored and archived. Furthermore, Wageningen UR’s new code of conduct for scientists stipulates that research data should now be kept for ten years rather than five. How do scientists do that? Where is all that data being stored?

You might think it is being kept in a central location, on the servers of the brand-new Hyperion data centre on campus. But that is not the case. Raoul Vernède, IT security manager at Facilities and Services, estimates that about half the chair groups store their data with Hyperion. ‘And if we look at the number of terabytes, we get a much lower proportion than that. The rest is everywhere and nowhere.’ On separate hard disks or servers purchased by the groups themselves, on external servers, in journals’ digital repositories or in the cloud. He is not quite prepared to call the situation a nightmare. ‘But it is certainly undesirable from a security perspective. Our future depends on our reputation, so we mustn't throw it away by running the risk of losing important data or becoming the victims of fraud.’

12-Hyperion GA--20140519-ND7_7850.jpg

No central control

‘There is no central control for data management,’ says Phytopathology professor Bart Thomma emphatically. ‘Everyone does their own thing. That is partly because of the huge diversity in the data produced by the different chair groups, and consequently in the criteria that the storage method has to satisfy. Our data mostly takes the form of files of genome sequences and genome expressions. They usually go in their entirety to a gene bank or the National Center for Biotechnology Information in the US. I'm not worried about those large datasets as they are incorporated in articles, which ensures their safe storage.’ Thomma has his own servers in the lab for digital data that is not used in articles. ‘The rest of the data is produced through lab work and is mainly stored in paper lab records.’

The Bioinformatics group uses digital lab journals, says Professor Dick de Ridder. ‘In our field, the focus is on the method used to achieve the results. We use Evernote for our digital lab journals. After a research project has finished, the data, software and lab journals are stored in a directory. We use Gitlab for the storage of the software, a service provided by the Forum library’s Data Management Support Hub.’ According to De Ridder, data management is actually determined to a large extent by the requirements of the scientific journals. ‘In molecular biology, it has been the case for years that the raw data used in a publication must be made available. That data is generally stored in data banks stipulated by the journals.’ De Ridder stores the data on his own servers, which are kept at Hyperion. ‘So we rent space internally but we manage the storage ourselves. IT does offer a storage service but it is far too expensive.'

12-Hyperion GA--20140519-DSC_2428.jpg

‘Far too expensive’

The IT service manager Stephen Janssen is familiar with the complaints about the high costs of his service. ‘We do everything we can to reduce the rates, but our charges have to cover all the costs. We halved our prices last October so that we could compete with cheap data storage servers such as NAS. These are servers that you can buy cheaply online or via Media Markt and connect up to the network.’

A terabyte of storage now costs 150 euros a year from the IT department. But Janssen says this does get you a professional service and secure, reliable, convenient data storage. The rates might have been halved, but it is still much too expensive for De Ridder and Thomma. ‘That would still cost us 28,000 euros a year including backups. We can't afford that,’ says De Ridder. His colleague Thomma even calls the rates ‘ridiculous’.

Janssen sees that chair groups are starting to act like mini IT departments because of the costs. ‘And that worries me. Some groups are doing quite a good job actually, but it's far from ideal and is asking for trouble.’

Inge Grimm, director of operations at SSG and Wageningen UR information manager as of this month, stresses the importance of central storage. But she admits that there are no firm agreements about this with the chair groups. ‘We don't force them to do this and there are no sanctions. There is central coordination in the sense of persuasion. Good, affordable internal storage options are needed. It is also important to raise awareness. Scientists need to be more aware than in the past of the dangers associated with the external storage of data.’

But it is precisely that awareness that is lacking, thinks service manager Janssen. ‘Those data management plans were introduced in 2014. We thought that this would create a substantial demand for our services. But unfortunately not. Last year, we were only approached for advice 12 times. It is difficult for us to get a foot in the door in the chair groups.’

Central storage with the IT department is far too expensive


Bioinformatics professor De Ridder thinks more central coordination might help. ‘The individual chair groups were put in charge of implementing the requirement to produce data plans and everyone had to figure things out for themselves. It would probably have been better to make arrangements at the level of the university. Gitlab, the platform for storing computer programs in a central location, is a good example of that.’

Phytopathologist Thomma has doubts about the benefit of more central direction. ‘To a large extent, the debate about data management was prompted by fraud affairs such as that of the psychologist Diederik Stapel. The idea is that you can help prevent fraud by managing data properly. But I don't buy that. I think that the argument that good data management will let us make more use of each other's data is much more relevant. A lot of data is lost at the moment because the data is so difficult to access. I would give the chair groups the responsibility for the coordination.’

Security manager Vernède on the other hand is in favour of more central control over data management. ‘You could, for example, set up a fund to enable proper centralized storage. I think that all the parties involved should get together and think about what risks we are willing to run and then draw up a recommendation for the Executive Board. There should be more central control over where the data ends up. At present nobody checks whether and how the data plans are being implemented.’