Data Management and Data Engineering

Research track

The research tracks and thematic axes of Health described above assume the gathering, integration and combination of multiple heterogeneous data sources, public and private, multimodal, collected and stored for long periods of time, both at the level of the individual (all-about-the individual) and at the public level (population). On the other hand, methods based on machine learning (predictive, descriptive models, simulations) directly depend on the availability of large amounts of data, with diversity and complementarity, and with high quality for training and validation. In this context, the great challenge of this research line is: How to continuously collect and effectively and efficiently integrate hundreds of heterogeneous, multimodal, historical data sources, ensuring the quality and provenance of the integrated data in order to make the data integration and management processes transparent, replicable, verifiable, auditable and explainable? Regarding the acquisition of medical data, we can consider two situations: (i) the use of pre-existing historical data, obtained, for example, from the SUS information systems, surveys on private health providers such as Unimed etc; (ii) the acquisition and gathering of new data from exams and / or wearable devices. In both cases, the information needs to be aggregated and integrated in order to guarantee the minimum quality requirements, listed above. From the point of view of acquisition / gathering of new medical data, the challenge is to provide gathering mechanisms that consider, at the initial stage of the development of health systems, data with minimum levels of quality and at low cost. Medical data is inherently noisy, coming from multiple heterogeneous sources (wearables, environmental sensors, echocardiograms, tomographies, X-ray images, photos of tissues and organs, social media), of limited quantity (for instance, it is not desirable to subject a patient to countless x-ray examinations or tomographies) and must be collected in such a way as to protect the patient’s privacy 70.

The data gathering process may apply, for example, mechanisms for filtering and merging patient data with a perspective of a more “intelligent” acquisition of  past health records. Filtering techniques should deal with  noise and the different frequencies of acquisition of this data. The integration of data from multiple sources, on the other hand, aims at increasing the accuracy of forecasts and the robustness of the results without increasing the costs to obtain the desired data 71. Considering the cost dimension in the gathering of medical data is  very relevant, since, in many cases, it is not possible to use high resolution devices, or even specialists, to obtain them 114. Data fusion explores the variety of patient data modalities. If the patient’s echocardiogram is noisy and with acquisition errors, medical records rich in details can enrich the patient’s past health information. Thus, filtering methods for treating different types of noise, combined with the fusion of patient data, may guarantee the necessary richness of data for  medical applications.

In terms of data integration, pairing, which recognizes multiple disparate representations of the same entity, is a central problem in this integration. This problem has historically been addressed through the recognition of similarities (eg affiliation, gender) that should consider missing or incorrect data. These solutions typically use efficient probabilistic methods, but they are difficult to train, validate and explain.

In summary, to address the research questions of data gathering and integration in order to achieve the listed objectives, we propose: (1) to develop techniques for creating representative data collections, including by means of filtering methods combined with fusion of patient data, which aggregate rich, discriminative data to be used by medical applications. Methods such as one-shot learning and / or self-supervision 115 even at the beginning of data processing will be explored to obtain data with minimal quality restrictions; (2) to develop and build integrated data infrastructures based on Data Lakes 116, continuously fed by various data sources, with heterogeneous and multimodal data (structured, unstructured, textual, multimedia, time series, etc.), and whose integration it is based on data pairings with advanced artificial intelligence methods; (3) to create and evaluate new data matching and integration solutions based on state-of-the-art methods and techniques, including those based on machine learning. These techniques should allow for automated learning of advanced ways of combining similarity functions from semantically richer representations (eg, embeddings 117, distance-based meta-attributes 118), and that consider the medical context of the intended integration and analysis. From a technical point of view, this implies on (i) defining the most appropriate algorithms / techniques for each purpose (eg, through Self-Learning 96); (ii) making these algorithms / techniques efficient (for example, through Meta-Blocking 93); (iii) enabling them to produce data with the necessary quality for each analysis (for example, through data fusion); (iv) allow the (semi-) automatic creation of labeled data for learning the data integration task (for example, through active learning 119 or co-training 120; and (v) stimulating the ability to verify and explain the results (through, for example, learning with hybrid models), giving reliability to its use in medical applications, and (4) Develop advanced provenance mechanisms 121 that make the integration process transparent, replicable, verifiable and auditable. Diverse applications of medical analysis and public health imply the need to guarantee representativeness of cases, data quality, reliability of sources and transparency in the process, in such a way that ethical and technical issues can be addressed. integration in general, and pairing in particular, adding unparalleled complexity to this challenge.

Principal Investigators: Altigran Silva, Marcos Gonçalves, Carisi Polanczyk, Marco Romano-Silva


70. White S. A review of big data in health care: challenges and opportunities. OAB. 2014 Oct;13.

71. Perez-Rua J-M, Vielzeuf V, Pateux S, Baccouche M, Jurie F. MFAS: Multimodal Fusion Architecture Search [Internet]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019. Available from:

93. dal Bianco G, Gonçalves MA, Duarte D. BLOSS: Effective meta-blocking with almost no effort. Inf Syst. 2018 Jun 1;75:75–89.

96. Cunha W, Canuto S, Viegas F, Salles T, Gomes C, Mangaravite V, et al. Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling [Internet]. Vol. 57, Information Processing & Management. 2020. p. 102263. Available from:

114. Nascimento BR, Martins JFBS, Nascimento ER, Pappa GL, Sable CA, Beaton AZ, et al. Deep learning for automatic identification of rheumatic heart disease in echocardiographic screening images: data from the ATMOSPHERE-PROVAR study. J Am Coll Cardiol. 2020 Mar 24;75(11, Supplement 1):3577.

115. Fernando B, Bilen H, Gavves E, Gould S. Self-Supervised Video Representation Learning with Odd-One-Out Networks [Internet]. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017. Available from:

116. Olston C, Korn F, Noy N, Polyzotis N, Whang S, Roy S. Managing Google’s data lake: an overview of the Goods system. 2016; Available from:

117. Cappuzzo R, Papotti P, Thirumuruganathan S. Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. New York, NY, USA: Association for Computing Machinery; 2020. p. 1335–49. (SIGMOD ’20).

118. Canuto S, Salles T, Rosa TC. Similarity-Based Synthetic Document Representations for Meta-Feature Generation in Text Classification. Proceedings of the 42nd [Internet]. 2019; Available from:

119. Cardoso TNC, Silva RM, Canuto S, Moro MM, Gonçalves MA. Ranked batch-mode active learning. Inf Sci . 2017 Feb 10;379:313–37.

120. Magalhães LFG, Gonçalves MA, Canuto SD, Dalip DH, Cristo M, Calado P. Quality assessment of collaboratively-created web content with no manual intervention based on soft multi-view generation. Expert Syst Appl. 2019 Oct 15;132:226–38.

121. Freire J, Chirigati F. Provenance and the different flavors of computational reproducibility. IEEE Data Engineering Bulletin. 2018;41(1):15.