top of page
  • Writer's pictureCristina Ferrero Castaño

What is the health data lake that Spain is calling for?

The Recovery Plan envisages the creation of a health data lake, a gigantic digital infrastructure to store data such as medical records for the entire country.


Alberto R. Aguiar

Business Insider

September 7, 2021


The Government wants to create a health data lake. It is a big data concept: what it wants to launch is a huge warehouse of raw health data with which to operate, carry out research, make predictions or detect trends. Spain wants to have the capacity to carry out "massive analysis in real time" of the health of its citizens.


The proposal is included in component 18 of the Recovery, Transformation and Resilience Plan, the document that articulates the arrival of European funds in the country. The initiative is driven by both the Ministry of Health and the Secretary of State for Digitalisation and Economic Affairs, and is valued at €100 million.


With this data lake, specialists and researchers could be able to perform real-time analysis to identify and improve diagnoses or treatments, analyse trends, identify patterns and even prevent health risk situations. Prevent, or at least anticipate more accurately, situations as critical as a pandemic.


Creating a huge digital infrastructure entails a number of challenges. Health competences are devolved, so it is a challenge to coordinate autonomous regions, hospitals and health centres. The government has not yet tendered or launched the call for tenders to set up this data lake, but it already has suitors.


Among the scientific societies and associations supporting the initiative are the Spanish Society of Medical Oncology, Internal Medicine, Digestive Pathology, Primary Care Physicians, Endocrinology and Nutrition, Intensive Care Medicine, the Platform of Patient Organisations and the Spanish Association Against Cancer.


How a health data lake works


Having such a data lake, with a huge catalogue of raw health data, would speed up health sciences research enormously. In the past, a researcher could go to a hospital to consult, either physically or electronically, the medical records of a thousand patients. With the data lake, he or she will be able to access the data of absolutely all of them.


This is possible because machine learning and natural language processing, two types of AI, are involved in normalising and structuring the vast amount of data that clinicians dump into their patients' medical record systems.


Si un traumatólogo pone en Galicia que un paciente sufre gonalgia, otro en Murcia detalla que otro paciente tiene "dolor de rodillas", y un tercero en Andalucía explicita que su paciente se queja de "dolor en una rodilla", la IA podrá resumir esos tres casos al mismo dato, dolor de rodilla, identificándolo con un código en concreto.


If a traumatologist in Galicia states that a patient suffers from gonalgia, another in Murcia details that another patient has "knee pain", and a third in Andalusia explains that his patient complains of "pain in one knee", the AI will be able to summarise these three cases to the same data, knee pain, identifying it with a specific code.


Another advantage is that all raw information is available at a glance. A patient can generate information after a visit to the emergency room, or to the operating theatre, or to a specialist. With a data lake you could see all the information that has been generated.


Privacy safeguards


Of course, what the government is proposing is to deposit exceptionally sensitive health data in a large data lake. This requires a host of safeguards. You have to work with perfectly anonymised data. Before putting them into its systems, the company does not know who owns the data or whose medical records it is dealing with.


In fact, it is proposed to generate two huge databases in the data lake. A pseudonymised database, to be effective in terms of management. With the pseudonymised database, the custodians (the hospitals, the managers) will be able to identify patients in order to be able to guarantee their healthcare. The other database will be completely anonymous and will be accessible to researchers from different organisations.


For example, with the predictive power of big data, a nationwide data lake would have enabled early diagnosis of coronavirus cases by detecting anomalies with the increase in pneumonia cases in the early 2020s.


Political commitment to pioneering


The European Union was already working towards a European data governance framework, which would help to create frameworks in which data (including health data) would be interoperable across countries and systems.


Spain has the opportunity to be a pioneer thanks to the application of AI and big data in conjunction with the digitisation of hospitals and the widespread use of electronic health records.


Another issue raised by this disruption has to do with Spain's technological sovereignty. The enormous amount of data that a data lake would generate means that conventional infrastructures, such as servers, cannot be relied upon.


Relying on the cloud for both the storage and processing of this data is therefore not an option. The question is which provider to trust. The big cloud players are foreign companies. For this very reason, the Ministry of Economic Affairs itself announced in June the creation of a hub at GAIA-X.


GAIA-X is the first Franco-German and now European initiative to create a federated European cloud, "with the values" of the Old Continent as its flag, and which could be the answer to these questions. In fact, back in June, the Ministry announced that Spain would develop the creation of a specific hub in GAIA-X for tourism data and another for health data.















124 views0 comments
bottom of page