Data management – from a section in the grant proposal to a day-to-day reference manual

By Krzysztof Cipora, Lecturer in Mathematical Cognition, Open Research Lead of the School of Science, Centre for Mathematical Cognition, Loughborough University, k.cipora@lboro.ac.uk, @krzysztofcipora

Most funding agencies require grant proposals to contain a data management plan. It may seem an extra burden to prepare yet another document, as all applicants have been handling research data and know how to do it, so why mandate such a technicality in the proposal? At the same time, many researchers have not been formally trained in data management. Open Research practices becoming more and more widely adopted (and more and more often mandated by funders) include Open Data, that is sharing research data, either in the public domain or granting access to other researchers. No matter whether shared publicly or with some restrictions, the data need to be understandable and usable. This requires the data to be thoroughly curated, documented, and at best to go along with the programming code used for its processing and analysis. Researchers also benefit from good data curation and documentation if they come back to their own data after a few months or years. Data management is even more critical in large-scale projects including many researchers, research assistants etc. However, the document typically supplied to the funder is relatively short (space restrictions!) and therefore quite generic, so it cannot fully satisfy the day-to-day data management needs.

In June 2022 at Loughborough University, we launched £ 9 989 000 Centre for Early Mathematics Learning (CEML; ceml.ac.uk) funded within ESRC Research Centre scheme. Funding covers a period of five years and over twenty-five researchers from several institutions are involved within five CEML challenges. They are supported by several research assistants and PhD students. Various types of data are being gathered. One of the crucial issues at the CEML onset was to ensure we are on the same page with data management. It has been necessary for several reasons both within CEML and for future data sharing. The same variables should be named and coded consistently across studies to streamline the readability of the data, facilitate the re-use of analysis code, and collapse datasets if needed. In case a researcher is reallocated from one study to another, they can catch up easily. Keeping our data curated and consistent for ourselves also makes it more accessible to other researchers when we share it with the community.

Together with other colleagues, I took on preparing a detailed Data Management Policy for the CEML. In the following, I briefly describe what we did and how this might be used as an example for other projects (including much smaller ones).

We started from the Data Management Policy from the CEML proposal and elaborated on the details (see CEML Data Management Policy https://doi.org/10.17028/rd.lboro.21820752). The document first outlines the responsibilities of Challenge Leads and Leads of specific studies being run (this is particularly important given the number of researchers involved in CEML). It specifies where the data are stored, who should have access to the data (working together with researchers from outside Loughborough University required some thinking of how to set this up efficiently), and when the data can be shared publicly. The document also specifies how to document the data entry process and how to document data analysis to ensure analytical reproducibility. We also specify the process of creating backups and data sharing.

The Data Management Policy refers to Variable Dictionary (see CEML Variable Dictionary https://doi.org/10.17028/rd.lboro.21820824) – a document providing detailed information on how to name and organise data files and how to name variables. We also provide a template for the meta-data file to be created for each study (see CEML Variable Dictionary Template https://doi.org/10.17028/rd.lboro.21820947).

To make these materials more accessible to CEML colleagues, we prepared a short video highlighting the most important aspects of the CEML data management and justifying why such detailed guidelines have been prepared (see CEML Data Management Training Video https://doi.org/10.17028/rd.lboro.21820713).

All these look quite elaborate and may not seem very useful for smaller projects. However, at least some of these points may be worth considering. It is worth remembering that “there will be at least two people working with your data, you and future you”. Thus, ensuring a consistent way of naming data files, their structure, variable names, the analysis code, and documenting the progress of data processing is a big favour to future you and, most likely, other researchers who may work with your data. Hopefully, the CEML documents linked may serve as a useful template on how to prepare a day-to-day data management reference for other projects.

The views and opinions of this article are the author’s and do not reflect those of the University…although hopefully they do reflect Loughborough University values.