Advertisement
Canadian Journal of Cardiology
Review| Volume 38, ISSUE 2, P169-184, February 2022

Download started.

Ok

A Primer on the Present State and Future Prospects for Machine Learning and Artificial Intelligence Applications in Cardiology

  • Cedric Manlhiot
    Correspondence
    Corresponding author: Dr Cedric Manlhiot, Johns Hopkins Hospital, 600 N Wolfe Street, 1389 Blalock, Baltimore, Maryland 21287, USA.
    Affiliations
    Blalock-Taussig-Thomas Pediatric and Congenital Heart Center, Department of Pediatrics, Johns Hopkins School of Medicine, Johns Hopkins University, Baltimore, Maryland, USA
    Search for articles by this author
  • Jef van den Eynde
    Affiliations
    Blalock-Taussig-Thomas Pediatric and Congenital Heart Center, Department of Pediatrics, Johns Hopkins School of Medicine, Johns Hopkins University, Baltimore, Maryland, USA

    Department of Cardiovascular Sciences, KU Leuven, Leuven, Belgium
    Search for articles by this author
  • Shelby Kutty
    Affiliations
    Blalock-Taussig-Thomas Pediatric and Congenital Heart Center, Department of Pediatrics, Johns Hopkins School of Medicine, Johns Hopkins University, Baltimore, Maryland, USA
    Search for articles by this author
  • Heather J. Ross
    Affiliations
    Ted Rogers Centre for Heart Research, Peter Munk Cardiac Centre, University Health Network, Department of Medicine, University of Toronto, Toronto, Ontario, Canada
    Search for articles by this author
Published:November 24, 2021DOI:https://doi.org/10.1016/j.cjca.2021.11.009

      Abstract

      The artificial intelligence (AI) revolution is well underway, including in the medical field, and has dramatically transformed our lives. An understanding of the basics of AI applications, their development, and challenges to their clinical implementation is important for clinicians to fully appreciate the possibilities of AI. Such a foundation would ensure that clinicians have a good grasp and realistic expectations for AI in medicine and prevent discrepancies between the promised and real-world impact. When quantifying the track record for AI applications in cardiology, we found that a substantial number of AI systems are never deployed in clinical practice, although there certainly are many success stories. Successful implementations shared the following: they came from clinical areas where large amount of training data was available; were deployable into a single diagnostic modality; prediction models generally had high performance in external validation; and most were developed as part of collaborations with medical device manufacturers who had substantial experience with implementation of new clinical technology. When looking into the current processes used for developing AI-based systems, we suggest that expanding the analytic framework to address potential deployment and implementation issues at project outset will improve the rate of successful implementation, and will be a necessary next step for AI to achieve its full potential in cardiovascular medicine.

      Résumé

      La révolution de l’intelligence artificielle (IA) qui va bon train, y compris dans le domaine médical, a radicalement transformé nos vies. Pour pleinement saisir les possibilités de l’IA, il est important que les cliniciens comprennent les principes fondamentaux des applications d’IA et de leur mise au point, et des défis liés à leur mise en œuvre dans la pratique clinique. Une telle base permettrait de garantir que les cliniciens ont une bonne compréhension de l’IA en médecine et des attentes réalistes à l’égard de celle-ci, et d’éviter des décalages entre ses promesses et ses répercussions réelles. Lors de l’évaluation quantitative relative aux applications de l’IA en cardiologie, nous avons observé qu’un nombre important de systèmes d’IA ne sont jamais employés en pratique clinique, malgré de nombreuses réussites. Les exemples de mises en œuvre fructueuses partagent les caractéristiques suivantes : elles se sont déroulées dans des secteurs cliniques où une grande quantité de données de formation était accessible; elles pouvaient être déployées dans le cadre d’une méthode diagnostique unique; les modèles de prédiction affichaient généralement un haut rendement de validation externe; et la plupart étaient élaborées dans le cadre de collaborations avec des fabricants de dispositifs médicaux ayant une expérience considérable dans la mise en œuvre de nouvelles technologies cliniques. Au regard des processus utilisés actuellement pour élaborer des systèmes fondés sur l’IA, nous pensons que l’élargissement du cadre analytique permettant de régler les problèmes potentiels liés à leur déploiement et à leur mise en œuvre dès le début du projet améliorera le taux de mise en œuvre fructueuse et sera une prochaine étape nécessaire pour tirer pleinement parti du potentiel de l’IA en médecine cardiovasculaire.

      Artificial Intelligence in Medicine

      The artificial intelligence (AI) revolution is well underway and has already dramatically transformed our lives in countless respects. Nowadays, most people interact with AI systems in routine daily activities, whether they realise it or not, and the fundamental impact that AI has on our world is undeniable. AI is also of great interest to medicine, with great promise and an unprecedented amount of resources devoted to the development of AI-based technologies in a variety of medical specialties, including cardiology. In many respects, the development and deployment of AI systems in medicine is still in the early stages compared with many other industries which started heavily investing in those technologies a few decades ago, first with expert systems and later with progressively more sophisticated algorithms as methods and computational resources improved. As such, comparisons between the status of AI in medicine and in other industries highlight substantial gaps both in the capacity of the technology and in the number of successful implementations.
      UNESCO
      The race against time for smarter development. UNESCO science report.
      ,
      • Jiang F.
      • Jiang Y.
      • Zhi H.
      • et al.
      Artificial intelligence in healthcare: past, present and future.
      There has also been a growing recognition that the expectations regarding the current capabilities of AI in medicine, when extrapolated from other areas of activities, might have been somewhat overstated compared with their current state of development,
      • Miller D.D.
      Machine intelligence for management of acute coronary syndromes: neural or nervous times?.
      ,
      • Modine T.
      • Overtchouk P.
      Machine learning is no magic: a plea for critical appraisal during periods of hype.
      and that the full real-world impact on patient outcomes has yet to be realised.
      • Keane P.A.
      • Topol E.J.
      With an eye to AI and autonomous diagnosis.
      To prevent a period of disillusion for AI in medicine due to inflated expectations, medical practitioners need a basic understanding of AI systems in medicine to be able to properly assess its capabilities and have realistic expectations. Therefore, this review will cover the following topics: 1) an introduction into AI theory to make the clinician familiar with some of the central concepts of this technology; 2) an overview of the process of building and evaluating AI systems, focusing on model training, validation, and quantifying model performance; 3) a discussion of the track record, barriers, and opportunities for implementation of AI models in clinical cardiology, with a review of implementation successes; and 4) a perspective on what can be done in the future to improve the rate of successful deployment and hopefully help AI to fulfill its promises for medicine and cardiology. Given our focus on clinical implementation in cardiology, we will not address the current state and latest stages of development of AI-based technology in cardiology subspecialties, but rather direct the reader to general
      • Quer G.
      • Arnaout R.
      • Henne M.
      • Arnaout R.
      Machine learning and the future of cardiovascular care: JACC state-of-the-art review.
      • Johnson K.W.
      • Torres Soto J.
      • Glicksberg B.S.
      • et al.
      Artificial intelligence in cardiology.
      • Shameer K.
      • Johnson K.W.
      • Glicksberg B.S.
      • Dudley J.T.
      • Sengupta P.P.
      Machine learning in cardiovascular medicine: are we there yet?.
      and area-specific reviews: electrophysiology,
      • Feeny A.K.
      • Chung M.K.
      • Madabhushi A.
      • et al.
      Artificial intelligence and machine learning in arrhythmias and cardiac electrophysiology.
      cardiac imaging,
      • Al’Aref S.J.
      • Anchouche K.
      • Singh G.
      • et al.
      Clinical applications of machine learning in cardiovascular disease and its relevance to cardiac imaging.
      • Dey D.
      • Slomka P.J.
      • Leeson P.
      • et al.
      Artificial intelligence in cardiovascular imaging: JACC state-of-the-art review.
      • Litjens G.
      • Ciompi F.
      • Wolterink J.M.
      • et al.
      State-of-the-art deep learning in cardiovascular image analysis.
      • Krittanawong C.
      • Johnson K.W.
      • Rosenson R.S.
      • et al.
      Deep learning for cardiovascular medicine: a practical primer.
      echocardiography,
      • Kusunose K.
      • Haga A.
      • Abe T.
      • Sata M.
      Utilization of artificial intelligence in echocardiography.
      interventional cardiology,
      • Sardar P.
      • Abbott J.D.
      • Kundu A.
      • et al.
      Impact of artificial intelligence on interventional cardiology: from decision-making aid to advanced interventional procedure assistance.
      management of atrial fibrillation,
      • Siontis K.C.
      • Yao X.
      • Pirruccello J.P.
      • Philippakis A.A.
      • Noseworthy P.A.
      How will machine learning inform the clinical care of atrial fibrillation?.
      and heart failure.
      • Shin S.
      • Austin P.C.
      • Ross H.J.
      • et al.
      Machine learning vs conventional statistical models for predicting heart failure readmission and mortality.

      Introduction to AI Theory

      AI is broadly defined as the development of computer systems that can display machine intelligence, that is, that are able to extract and use information from their environment to achieve a specific goal in a process that mimics human intelligence but without human intervention. Although the concept of AI has been around ever since the design of the precursors of modern computers by Alan Turing in the 1950's, real-life applications were limited until the long-awaited “thinking machines” envisioned by Turing became technically feasible.
      • Turing A.
      Computing machinery and intelligence.
      The exponential growth in large volumes of available data, advancements of computational power and cloud computing, and the advent of machine learning (ML) have all been instrumental in the explosion of AI use in many fields, including medicine.
      Early efforts in the field of AI focused on so-called strong AI, the generation of systems that use high-level processes to mimic human reasoning and apply those processes to new problems. This approach was expected to eventually lead to general AI, through which an AI system would be able to accomplish any task within the realm of human capability without needing task-specific training. These early efforts were very successful in some respects, particularly through the development of expert systems for situations that are rule dependent (eg, playing checkers
      • Samuel A.L.
      Some studies in machine learning using the game of checkers.
      ) but tended to perform poorly for simpler tasks that are intuitive to humans but not necessarily rule based, and this approach also proved to have an upper limit in the level of complexity of the problems it can handle. Because of these limitations, the field of AI moved from general AI to narrower task-based AI (sometimes referred to as weak AI). Task-based AI systems are at the center of the current AI revolution. Roughly speaking, tasks performed by AI systems can be related to perception (computer vision, natural language processing), information processing (reasoning/problem solving, knowledge representation, learning), and performative tasks (task planning/optimisation, data generation including language and images, object manipulation and motion). By far, the majority of AI systems being developed in medicine now are specifically targeting learning tasks, but nonmedical applications routinely use the full constellation of AI tasks. AI systems can combine these tasks to increase the complexity of the situation or problem they can address.
      • Russell S.J.
      • Peters N.S.
      Artificial Intelligence: A Modern Approach.
      Although there is a growing consensus that task-based AI is unlikely to lead to general AI, which is now a separate area of research, the capabilities of current task-based AI systems have far exceeded systems built using other approaches.
      • Fjelland R.
      Why general artificial intelligence will not be realized.
      In addition to the tasks they perform, AI systems can also be classified based on their intended use, which has important considerations on how those systems are built, evaluated, monitored, and regulated. AI systems can be designed to be automatic, assisting, augmenting, or autonomous (Fig. 1). Automatic systems, designed to handle simple repetitive tasks, and assisting systems, designed to provide the user with more information (eg, diagnostic, features, etc) that they can use in decision making, are lower-complexity systems. Augmenting systems, on the other hand, help the user to improve overall performance, either by providing insights that would not otherwise be available, such as patient-level predictions, or by performing part of a task directly (eg, robotic assistance). Finally, autonomous systems are those systems that perform complex tasks without human intervention. There are 2 major concepts underlying the intended use framework, first the degree of complexity and persistence (a function of how the task[s] being performed change[s] over time) and second the degree of human involvement in accomplishing the tasks. Automatic and assisting systems generally have low complexity and requirement for adaptability over time (ie, the extent to which the tasks or the underlying environment changes over time), whereas augmenting and autonomous systems handle tasks with a high degree of complexity that have substantial adaptability requirements. Assisting and augmenting AI systems still require a large amount of human involvement and oversight, whereas automatic and autonomous systems work independently.
      Unsurprisingly, automatic and some assisting systems currently have the highest rate of successful implementations, and augmenting applications are still in development. Autonomous AI is not yet a reality in medicine, but autonomous systems do exist in full deployment in many other industries.
      Figure thumbnail gr1
      Figure 1Classification of AI systems by purpose, complexity, and level of independence. Current and future examples of applications in cardiovascular medicine are indicated in each of the quadrants, with green indicating US Food and Drug Administration–approved applications, orange indicating applications currently being research (at various levels), and red indicating use cases. AI, artificial intelligence; ECG, electrocardiography; EHR, electronic health record; NLP, natural language processing.

      Basics of Machine Learning

      With the emerging role of AI in daily medical practice, it is important for clinicians to understand some basic terminology related to AI in this context (Fig. 2). As previously mentioned, in the medical realm, the overwhelming majority of applications focus on the learning task and use ML as the primary methodology. Machine learning encompasses a group of algorithms that learn to accomplish a task without being explicitly programmed to do so. These algorithms use sample data to learn patterns that are then used for predictions or classification. Within the realm of ML, there are multiple learning approaches (supervised and unsupervised learning being the most common approaches) based on the purpose and type of training data available, and each approach has a number of algorithms available. Deep learning, which uses neural networks, is one such algorithm among many other options, not all of which are based on neural networks. These terms are different from data science, data mining, and big data. Data science is the discipline concerned with applied ML (in contrast to computer science, which is concerned with the development of ML algorithms), and data mining is a subdiscipline of data science concerned with exploratory data analysis (as opposed to predictive analytics in the context of ML or traditional statistical testing). On the other hand, big data refers to the field of technologies and methods used to handle data sets larger than one typical computational platforms can handle. In some areas, and particularly in mainstream contexts, the term big data is used as an umbrella term that incorporates data mining, data science, and the wide implementation of ML-based algorithms, but this formulation should be avoided in academic and technical settings in the same way that the terms defined in this section should not be used interchangeably or as synonyms.
      Figure thumbnail gr2
      Figure 2The landscape of health data and analytics. CNN, convolutional neural network; RNN, recurrent neural network.
      As noted above, the 2 most common approaches to ML are supervised and unsupervised learning, both of which have different objectives (Fig. 3). Supervised learning is primarily used for classification and prediction problems. Supervised learning methods require a labelled data set (also known as “ground truth”), which is typically a diagnosis, membership in a group or class, or an outcome. The algorithm is then trained, using example data, to recognise patterns of features (ie, known facts about the patient) associated with a specific label and thereafter to predict those labels on new cases. There are multiple categories of supervised learning algorithms, the most common of which include regression models, decision trees, support vector machines, and neural networks (the underlying algorithm for deep learning). Each of these families of algorithms include multiple permutations/modifications to apply to specific situations. A brief introduction to deep learning is warranted here given the popularity of those methods. Deep learning is a method that uses a layered structure of algorithms called an artificial neural network (ANN). Mimicking the activity of biological networks between neurons in the human neocortex, these models lead to a process of learning that can reach far beyond the capabilities of other ML models. An ANN consists of many simple connected neurons, each of which produces a sequence of activations. Input neurons receive the initial input to the model, whereas subsequent neurons become activated through weighted connections from previously active neurons. As the input travels through the network, it is modified by these weights to eventually generate the output. The learning element in ANN thus involves the search for the weights that make the model exhibit the desired behaviour. Subtypes of ANN include convolutional neural networks, such as those used for object recognition from imaging data, and recurrent neural networks (RNNs), which are applied in natural language processing (NLP).
      • Krittanawong C.
      • Johnson K.W.
      • Rosenson R.S.
      • et al.
      Deep learning for cardiovascular medicine: a practical primer.
      Figure thumbnail gr3
      Figure 3Methods of learning for machine learning systems: (A) Supervised learning, (B) unsupervised learning, (C) reinforcement learning, and (D) semisupervised learning.
      Unsupervised learning, in contrast, works with unlabelled data sets. This type of model attempts to detect the natural patterns present within a set of scattered data points without dependent variables. Unsupervised learning can be used to identify like cases or to identify groups of cases sharing similarities (eg, clustering methods such as k-means, hierarchic clustering) as well as for dimension reductions (eg, principal component analysis). Unsupervised learning methods are often used to generate features that are then included in supervised learning algorithms (eg, cluster membership or principal component loads), but they can also be used for fundamental discovery. Cluster membership can reveal novel patterns or subphenotypes that were not previously clinically recognised and change our understanding of the spectrum associated with a clinical entity. For example, the application of unsupervised clustering on data sets from patients with heart failure with preserved ejection fraction has led to the discovery of 3 distinct phenogroups, each with a different pathophysiology and likely to respond to different therapeutic regimens.
      • Katz D.H.
      • Deo R.C.
      • Aguilar F.G.
      • et al.
      Phenomapping for the identification of hypertensive patients with the myocardial substrate for heart failure with preserved ejection fraction.
      ,
      • Horiuchi Y.
      • Tanimoto S.
      • Latif A.
      • et al.
      Identifying novel phenotypes of acute heart failure using cluster analysis of clinical variables.
      A review by Johnson et al. provides a good introduction, intended for clinicians, to each of the main families of algorithms used in ML,
      • Johnson K.W.
      • Torres Soto J.
      • Glicksberg B.S.
      • et al.
      Artificial intelligence in cardiology.
      and a review by Shameer et al. provides comprehensive examples of cardiology applications of many types of ML methods.
      • Shameer K.
      • Johnson K.W.
      • Glicksberg B.S.
      • Dudley J.T.
      • Sengupta P.P.
      Machine learning in cardiovascular medicine: are we there yet?.
      There are other methods to train ML algorithms, including semisupervised learning, where algorithms are trained using a small amount of labelled data with a large amount of unlabeled data, and reinforcement learning, where algorithms are trained through a reward/penalty system for correct/incorrect predictions, but these are still relatively new and there are very few examples of application of these methods in a clinical context, although reinforcement learning has been used to optimise treatment of patients with sepsis in the intensive care unit.
      • Komorowski M.
      • Celi L.A.
      • Badawi O.
      • Gordon A.C.
      • Faisal A.A.
      The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care.

      The Process of Building and Evaluating ML Systems for Clinical Use

      A major requirement for ML systems to be sustainable and clinically useful is their continuous assessment and evaluation at all points in their life cycle. This evaluation goes beyond the assessment of model performance to include an evaluation of the needs for ML systems to address a specific clinical question, an assessment of potential future implementation challenges, and continuous evaluation of model performance and changes in outcomes after deployment. At the outset, the question of whether ML is a good solution or even the best solution for a specific problem should always be asked. Although ML has been used to create highly accurate algorithms, there is now a recognition that in some circumstances traditional statistical methods perform just as well as ML-based algorithms.
      • Christodoulou E.
      • Ma J.
      • Collins G.S.
      • et al.
      A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models.
      These include situations where the size of the available training data is too small or where the feature space is limited (small number of mostly binary and normally distributed continuous inputs). The use of ML should also be considered carefully in situations where a high degree of explainability and transparency is required,
      • Rudin C.
      Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.
      even though some novel ML approaches can now produce more transparent models.
      • Lauritsen S.M.
      • Kristensen M.
      • Olsen M.V.
      • et al.
      Explainable artificial intelligence model to predict acute critical illness from electronic health records.
      Model training should only start when the appropriateness of ML in a specific situation has been established.

      Model training

      Model training first requires the selection of a broad analytic approach, the choice is often task dependent and based on the nature of the input data. In some cases, multiple type of algorithms are potential candidates and the choice is based on a comparison of model performance between the different options. Algorithm training often requires the selection and tuning of hyperparameters: ie the options specific to each type of algorithms that will yield the best performing model. Hyperparameters are often adjusted empirically, that is, by trial and error until the best combination is found. Although the details of algorithm training provided in scientific paper can seem complex for the uninitiated, clinicians should remember that no single type of algorithm works in all situations and ML studies should report the considerations that led to the choice of a specific analytic approach. However, as part of model training, there are 2 additional considerations that are common to all studies and will be addressed below: establishing ground truth and the representativity of the training sample.
      With supervised learning still representing the majority of ML models in medicine, the establishment of labels, also known as ground truth, remains a major point of attention. Establishing ground truth is the process by which labels (the target of the prediction algorithm) are assigned to training cases. This is a critical component of algorithm training, given that mislabelled training cases result in a poorly performing algorithm. The process of properly labelling training cases is resource intensive and often requires substantial amount of work from content experts.
      In some situations, high-quality training data are readily available; for example, ML algorithms in signal processing can rely on large public databases for training, which have become widely available given the large number of monitoring devices with digital storage and improved computing capacity. In fact, electrocardiographic (ECG) data were the first to be digitised and shared on platforms such as the PhysioNet database and the MIT-BIH arrhythmia database.
      • Goldberger A.L.
      • Amaral L.A.
      • Glass L.
      • et al.
      PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals.
      A similar wealth of information for model development can be found for imaging data that has been made publicly available on CAMUS
      • Leclerc S.
      • Smistad E.
      • Pedrosa J.
      • et al.
      Deep learning for segmentation using an open large-scale dataset in 2D echocardiography.
      and STACOM. The latter, however, often require careful annotation, which is subject to intra- and inter-rater variability even if done by experts.
      • Bunting K.V.
      • Steeds R.P.
      • Slater L.T.
      • et al.
      A practical guide to assess the reproducibility of echocardiographic measurements.
      Clinical risk models present additional challenges. They often rely on sensitive patient information that cannot be completely anonymised, and therefore cannot be released publicly, although some data generated in clinical trials or in large research initiatives such as the UK Biobank can be accessed by qualified investigators.
      • Sudlow C.
      • Gallacher J.
      • Allen N.
      • et al.
      UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age.
      Therefore, public databases for these purposes are somewhat limited and data collection remains costly and time intensive. This is particularly prevalent for relatively rare conditions such as congenital heart disease, myocarditis, or cardiomyopathy, where the issue of limited data availability is further compounded and multicentre collaboration in large, and costly, registries is necessary. Further compounding the problem, many variables that are of interest to clinical cardiology are contained in administrative data or within patient health records. As a general rule, administrative data are not an optimal solution to obtain high-quality training data, especially when they are derived from claims data, which are generally not coded by content experts. Indeed, multiple studies have shown a substantial degree of disagreement between expert review of medical records and administrative data in cardiology and beyond. Although the increased adoption of electronic health records over the past decade has resulted in more point-of-care clinical data being available,
      • Furukawa M.F.
      • King J.
      • Patel V.
      • et al.
      Despite substantial progress in EHR adoption, health information exchange and patient engagement remain low in office settings.
      a lot of this data is unstructured free text and largely inaccessible. NLP has been investigated as one potential strategy to access these data for analysis.
      • Jha A.K.
      The promise of electronic records: around the corner or down the road?.
      Wang et al.
      • Wang Y.
      • Luo J.
      • Hao S.
      • et al.
      NLP based congestive heart failure case finding: a prospective analysis on statewide electronic medical records.
      and Meystre et al.
      • Meystre S.M.
      • Kim Y.
      • Gobbel G.T.
      • et al.
      Congestive heart failure information extraction framework for automated treatment performance measures assessment.
      have reported some initial examples applied to heart failure, where NLP allowed identifying heart failure cases in uncoded data with a positive predictive value (PPV) of 0.914 and extracting heart failure medications with an overall precision of 0.955. This being said, even the most sophisticated NLP algorithms are still facing major technical challenges, including differences in terminology, local conventions/peculiarities, and homonymy. With the current state of the technology, the use of NLP to generate high-quality training data is unlikely to be a robust solution given both the introduction of mislabelled training cases and the fact that NLP models themselves rely on large data sets of ground truth, which also require chart verification and annotation. Furthermore, NLP models are highly prone to failure when applied outside the training environment, because they are highly context specific and have limited ability to retrain themselves after changes in the deployment environment.
      The other important consideration in generating and assessing the quality training data set is its representativeness. As a general rule, the characteristics of the population in the training data set should be representative of the population in which the model will eventually be deployed. One should also be aware that training data can reflect historical racial, societal, and economic disparities that could be perpetuated or even exacerbated by implementing the model,
      • Parikh R.B.
      • Teeple S.
      • Navathe A.S.
      Addressing bias in artificial intelligence in health care.
      ,
      • Wolff R.F.
      • Moons K.G.M.
      • Riley R.D.
      • et al.
      PROBAST: a tool to assess the risk of bias and applicability of prediction model studies.
      or institution-specific bias where local peculiarities are learned as rules by the algorithm.
      • Zech J.R.
      • Badgeley M.A.
      • Liu M.
      • et al.
      Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study.
      Therefore, when critically appraising prediction models, particularly those created with the use of supervised learning, it is important to consider the quality of the labels, how they were obtained, the likelihood of them being misclassified, and the likelihood of ascertainment bias. As far as representativity, one should consider the source and characteristics of the patients included in the training data set. For clinicians, it is best to think of the quality of the labelling and the representativeness of the training data set with the same care and importance as, respectively, the primary outcome and study population of a clinical trial.

      Validation strategies

      At the outset of ML model development, the original data set is usually partitioned randomly into 2 parts: one for training and another for testing (or validation/hold-out). The training data set is the part of the data set from which the ML algorithm learns the relationship between features and the target variable (label). The testing data set is the part of the input data to which the algorithm, after it has been taught based on the training data set, is applied to validate whether it accurately performs the task. This validation strategy is feasible and favoured for very large data sets. In smaller data sets, however, using this method is not always possible. In those cases, another validation strategy known as k-fold cross-validation is used. It involves the data set being randomly split up into k groups, where each group is used, in turn, as the validation cohort for a model trained on all of the other groups (k − 1 groups). The process of splitting, model development, and testing is repeated k times such that every group has been the testing data set once. It is worth noting that k-fold cross-validation can also be used during model training, in which case the model is trained k times using cross-validation, with the learners merged at the end of the training process and then further evaluated on the hold-out data set. This method has been shown to improve model generalisability.
      While both of these approaches validate whether the ML model accurately performs the task that it has been trained for, they still report on the performance of the model in the same population it was derived from, which would thus be expected to have characteristics very similar to the training data set. This process is therefore referred to as internal validation. In contrast, external validation aims to assess generalisability of the ML model when applied to a different population. As a general rule, all prediction models should be evaluated through cross-validation and performance metrics should be reported for the validation sets. The hold-out method is not always possible in smaller samples, because it reduces the amount of data available for training, and it has been argued that unless the available sample size is extremely large, cross-validation should be preferred to hold-out validation given that they essentially provide the same information.
      • Moons K.G.
      • Altman D.G.
      • Reitsma J.B.
      • et al.
      Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD): explanation and elaboration.
      It has been a common practice for hold-out validation to be used instead of external validation; however, this practice is not valid
      • Steyerberg E.W.
      • Bleeker S.E.
      • Moll H.A.
      • Grobbee D.E.
      • Moons K.G.
      Internal and external validation of predictive models: a simulation study of bias and precision in small samples.
      ; external validation requires the use of a fully independent data set.
      • Moons K.G.
      • Kengne A.P.
      • Grobbee D.E.
      • et al.
      Risk prediction models: II. External validation, model updating, and impact assessment.
      Given that internal validation is nearly universal as part of model development, internal performance metrics are generally included in scientific publications.
      • Moons K.G.
      • Altman D.G.
      • Reitsma J.B.
      • et al.
      Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD): explanation and elaboration.
      On the other hand, reporting of external validation used to be limited in scientific publications, although it is now a requirement for more and more scientific journals. A recent review by Wessler et al.
      • Wessler B.S.
      • Nelson J.
      • Park J.G.
      • et al.
      External validations of cardiovascular clinical prediction models: a large-scale review of the literature.
      revealed that only 58% of cardiovascular clinical prediction models in the Tufts Predictive Analytics and Comparative Effectiveness Clinical Prediction Model Registry had undergone external validation, leaving much room for improvement.
      External validation of prediction models is important because they tend to perform worse outside their training data set. The review by Wessler et al.
      • Wessler B.S.
      • Nelson J.
      • Park J.G.
      • et al.
      External validations of cardiovascular clinical prediction models: a large-scale review of the literature.
      found that in 81% of external validations of cardiovascular clinical prediction models, the area under the receiver operating characteristic curve (AUC) showed discrimination below that reported in the training data set. The AUC value decreased by a median percentage of −11.1% (interquartile range −32.4% to +2.7%). Model performance often decreases in external validation cohorts for multiple reasons, including potential differences, even small ones, between the characteristics of the populations. Another culprit, particularly affecting prediction models based on electronic health records, is a difference in the quality of data curation between the highly curated training data and the generally messier real-world validation data. But the most important reason is often related to model overfitting. This happens when the model learns patterns that are specific to the training data but are irrelevant to the underlying relationship between features and the target (ie, the model has become adapted to noise rather than signal). Thus overfitted models are unable to generalise to new observations. If the model has a high accuracy in the training data set but performs poorly in the testing data set, this may indicate overfitting. This is particularly prevalent in studies with small sample sizes, rare outcomes, and reporting of performance only from the original training data set. One solution to prevent overfitting is using more training data, although this is not always practically feasible. A second solution is to use cross-validation, thus reducing the capacity of the model to memorise training data and increasing the chance of incorporation of signal into the model. Other options include limiting the number of features, early stopping (stopping the training process before overfitting occurs), regularisation (forcing the model to be simpler), and ensembling (combining predictions from multiple separate models). This can be done by using different ML methods, using only a subset of the variables in each of the training sets (bagging), or using only a subset of the observations in each training set (bootstrapping). Thus, the critical appraisal of prediction models should always include considerations for both internal and external validation of model performance.

      Assessment of model performance

      Model performance can be defined by several measures, none of which capture all aspects of model performance, nor should they be interpreted in isolation.
      • Alba A.C.
      • Agoritsas T.
      • Walsh M.
      • et al.
      Discrimination and calibration of clinical prediction models: users’ guides to the medical literature.
      The first aspect of model performance is discrimination, the ability of a model to accomplish the task it is designed to do (eg, accurately separating patients who experienced an event and those who did not). Good discrimination is achieved when the predicted probabilities of the different groups have very little overlap with each other, thus accurately delineating membership in each group. The most common measure of discriminative ability is the AUC, also known as the C-statistic or C-index (for concordance between model-based estimates and observed events). The AUC represents the probability that, given one individual with the event and another without, the model will yield a higher predicted probability for the first than for the second. This value generally ranges from 0.5 (random concordance) to 1 (perfect concordance). It is important to understand that AUC provides a more balanced estimate of the discriminative ability of a model than accuracy (the sum of true positives and true negatives divided by the total number of patients). At low prevalence, accuracy obscures the outcome, whereas at high prevalence it weights the outcome too much. To illustrate this, a prediction model for risk of endocarditis that would predict “no risk of infective endocarditis” for everyone regardless of their characteristics would have an accuracy of > 99.9% (because the incidence is only 11-15/100,000 population), whereas the AUC would be 0.500.
      • Pant S.
      • Patel N.J.
      • Deshmukh A.
      • et al.
      Trends in infective endocarditis incidence, microbiology, and valve replacement in the United States from 2000 to 2011.
      Although the AUC is more informative than raw accuracy, a drawback is that it equally values false positives and false negatives, which is antithetical to clinical medicine. Alternative metrics are the PPV (proportion of patients with predicted event who actually have an event) and the negative predictive value (NPV; proportion of patients for whom the model does not predict an event and who actually do not experience the event), both of which are measures of a model’s misclassification. Both measures can be useful in situations where the consequences of wrong (prediction-based) management can involve serious harm for the patient. One such example is the decision to proceed with cardiac surgery in high-risk patients. If the model predicts a higher risk of major adverse cardiac events with conservative management and the model has a high PPV, the surgeon may reasonably rely on the model’s prediction and advise surgical treatment; however, if the model has had a low PPV, there would still have been a considerable chance that a patient who might have fared better with conservative management has wrongfully been exposed to the risk of surgery (eg, if the PPV was 60%, this would account for 40% of the decisions for surgery). Conversely, 1 − NPV gives a good indication of the patient’s chances of having an event even though the model predicted they would not. There are multiple additional metrics that assess various aspects of model performance, but at a minimum, all models should be reported with AUC and at least 1 additional set of metrics addressing the differential value of false positives and false negatives.
      Beyond discrimination, the other component of model performance is calibration, a measure of how accurate an estimate of the risk prediction is in the targeted population. This measure is often presented in a calibration plot, which depicts the observed values (on the y-axis) vs the predicted values (on the x-axis). The 45 degrees line in this plot denotes perfect calibration, meaning that the predicted value exactly matches the observed value in the data set. From the calibration plot, it is easy to assess the range of values for which the model has poor calibration (Fig. 4). Model fit, or internal consistency, can also be assessed with the use of goodness-of-fit tests, such as the Hosmer-Lemeshow test. However, while a significant P value (P ≤ 0.05) implies poor model fit, the opposite is not true: a model with P > 0.05 can still have poor calibration. Furthermore, a simple goodness-of-fit test does not provide the same detailed information as a calibration plot. For these reasons, the latter is usually preferred. Unfortunately, model calibration has not consistently been reported in the past, although it is now a requirement for more and more scientific journals.
      Figure thumbnail gr4
      Figure 4Examples of calibration plots for prediction models: (A) example of good calibration, (B) example of imprecise prediction model, (C) a situation where a prediction model overestimates risk in low-risk patients, and (D) a situation where a prediction model underestimates risk in high-risk patients.
      One important final note about the evaluation of prediction models is that model performance is a dynamic concept, not a static one. Because medicine is a continuously evolving field, prediction models become stale and training data sets obsolete. Therefore, performance should be consistently monitored and models should regularly be updated to reflect the current state of the art.
      • Krittanawong C.
      • Johnson K.W.
      • Rosenson R.S.
      • et al.
      Deep learning for cardiovascular medicine: a practical primer.

      What Is the Implementation Success Rate for ML-Based Technology in Cardiology?

      After reviewing the complexities that go in the development of ML-based algorithms, we will now discuss issues surrounding implementation. One of the most common outcomes for clinical prediction models, whether ML-based or not, is the low proportion of those models that move from development to actual use in clinical settings despite excellent performance in a sandbox environment. Although there are multiple reasons for this, some of which will be investigated in further detail here, one of the most important culprits is the limited focus on future implementation challenges as part of model development to prevent future deployment failure.
      To illustrate this point, we performed a review of the literature on ML and AI in cardiology from 1990 to 2020. The methodology that was used for that review included an NLP algorithm and has been described previously by Quer et al.
      • Quer G.
      • Arnaout R.
      • Henne M.
      • Arnaout R.
      Machine learning and the future of cardiovascular care: JACC state-of-the-art review.
      Briefly, bibliometric information was retrieved on all publications identified through a PubMed search using combinations of the search terms “clinical decision,” “decision support,” “algorithms” with either “card∗ abnormalities,” “card∗ disease∗,” “card∗ surg∗,” “heart disease,” “card∗,” and “cardiology.” The publications were reviewed and annotated according to whether they were original research articles, whether they used ML methods, and the disease and modalities they studied.
      In the period 1990-2020, a total of 11,118 articles on algorithms in cardiology were published in PubMed, including 6276 articles using ML models and 4,842 applying “traditional” regression models (Fig. 5A ). Whereas the latter stabilised in the past decade, the numbers of publications using ML models continued to increase at a rate of ∼10%/year, representing tremendous growth. In 2020, 2 ML articles were published for each article using traditional regression models for prediction problems. The heatmaps in Figure 5B show the number of articles by disease and modality, revealing that computed tomography (CT), ECG, and echocardiography in the setting of atherosclerosis, as well as ECG in the setting of arrhythmias, were the most commonly studied topics. On the other hand, many other topics have received less attention and remain underrepresented in the literature. These trends were comparable between ML and traditional regression, although the use of ML models was 3 times more common in articles about arrhythmia detection based on ECG than were traditional regression models.
      Figure thumbnail gr5
      Figure 5Trends in articles related to machine learning and artificial intelligence in cardiology during 1990-2020. (A) Number of articles per year, grouped into machine learning models and ‘traditional’ regression models. (B) Heatmap representing the number of publications by disease and modality. CT, computed tomography; ECG, electrocardiography; EHR, electronic health record; MRI, magnetic resonance imaging. (C) Number of publications with the keywords “implementation,” “evaluation,” “integration,” “application,” or “validation” (solid) compared with the total number of publications (shaded) per year.
      Subsequently, we used specific key words to estimate the overall implementation rate of these articles. We found that 437 articles mentioned the key word “implementation,” 1292 mentioned “evaluation,” 139 mentioned “integration,” 1016 mentioned “application,” and 860 mentioned “validation,” suggesting that a total of 2587 articles (23.3%) included some consideration of implementation of their algorithm. Figure 1C reveals that this percentage remained stable over time, while the absolute number of articles reporting implementation increased.
      Because the above analysis was primarily an automated informal survey of the literature intended to identify global patterns, we conducted a manual review of articles published in the year 2018 (2018 was selected because it gave at least 2 years of post-publication time for implementation). We identified a total of 97 articles that reported ML-based clinical risk prediction models. Here, we assessed implementation rate with the use of more strict criteria: The model had to have received US Food and Drug Administration (FDA) clearance,
      • Muehlematter U.J.
      • Daniore P.
      • Vokinger K.N.
      Approval of artificial intelligence and machine learning-based medical devices in the USA and Europe (2015-20): a comparative analysis.
      or there had to be very clear evidence for implementation in real-life clinical practice available in the article itself and/or elsewhere on the internet. Using these criteria, only 3 articles (3.1%) were found to have had true implementation success. This number should be interpreted carefully, however, given that many ML-based systems for cardiovascular disease that were cleared by the FDA came from the industry (as opposed to academia), with little previous publication in the scientific literature.

      Why Are ML-Based Technologies Failing Implementation at Such a High Rate in Clinical Settings?

      Implementation challenges with new technology and with predictive models in clinical medicine are not new. In a systematic review of clinical prediction models developed in the field of cardiovascular diseases from 1990 to 2012, Wessler et al. identified 796 models. The majority (667, 84%) of these models were based on logistic and Cox regression, and presented lack of standardisation, generally poor quality of reporting, limited penetration in clinical practice, and little impact on patient care.
      • Wessler B.S.
      • Lai Yh L.
      • Kramer W.
      • et al.
      Clinical prediction models for cardiovascular disease: Tufts Predictive Analytics and Comparative Effectiveness Clinical Prediction Model database.
      Some of those challenges have been partially addressed but they still exist and affect ML models in the same way that they affect clinical prediction models created with more traditional statistical methods. Other implementation challenges are either unique to ML models or at least are amplified because of some of the unique features of those algorithms.

      Mismatched expectations

      The first major challenge to implementation in ML models is one of mismatched expectations and poor fit between the current capabilities of ML systems and their intended use. In the lead to the current hype toward ML in medicine, there have been many claims about the capabilities of these systems that were either unrealistic or at least premature. However, it is important to realise that in its current form, ML is generally an extension of traditional modelling approaches, with their own limitations.
      • Beam A.L.
      • Kohane I.S.
      Big data and machine learning in health care.
      Many medical applications of ML systems have been envisioned beyond these limitations and as such invariably fall short of expectations.
      Part of the hype around ML in medicine is the idea that ML-models are somewhat of a magic bullet that can solve a variety of problems that have affected the development and implementation of clinical prediction models. The phrase “using an ML approach,” as though it will lead to some new transformational approach, clearly highlights that there is a high degree of unnecessary use of ML. Unfortunately, ML models remains affected by sample size, poor quality, and inherent bias in the training data, poor study design, and mislabelling of outcomes. Indeed a recent meta-analysis showed no improved performance for clinical prediction models created with ML than logistic regression when based on traditional data elements extracted from the electronic health record.
      • Christodoulou E.
      • Ma J.
      • Collins G.S.
      • et al.
      A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models.
      However, one area where ML-based models have shown an improvement over traditional methods is that of clinical prediction models that include features derived from diagnostic modalities (eg, imaging, ECG, physiologic data). In a comprehensive review of ML-based prediction models published through 2019, the average AUC for published models was > 0.90 in most clinical areas. In comparison, the AUC was ∼0.80 from the models included in Wessler et al.
      • Wessler B.S.
      • Lai Yh L.
      • Kramer W.
      • et al.
      Clinical prediction models for cardiovascular disease: Tufts Predictive Analytics and Comparative Effectiveness Clinical Prediction Model database.
      More than 80% of the clinical prediction models had been created with logistic or Cox regression, and those incorporating diagnostic-derived features performed better than those without those features.
      • Krittanawong C.
      • Virk H.U.H.
      • Bangalore S.
      • et al.
      Machine learning prediction in cardiovascular diseases: a meta-analysis.
      The discrepancy between the intended use of ML in medicine and its actual capabilities is one of the major reason for the high implementation failure rate of ML-based applications in medicine.

      Model performance

      The second limitation is poor model performance outside of the training settings. Clinical prediction models tend to be developed in a sandbox environment with curated data and few, if any, implementation challenges. Once deployed in the real world, these ML models tend to perform much more poorly then their published performance. There are many reasons for the decrease in performance,
      • Krittanawong C.
      • Johnson K.W.
      • Rosenson R.S.
      • et al.
      Deep learning for cardiovascular medicine: a practical primer.
      most notably: 1) reliance on medical records for model training, which tend to be incomplete and prone to erroneous data, reflect current medical practice whether or not it is optimal, and have depreciating representativeness over time; 2) lack of external replication studies before deployment; 3) bias in model training reflecting historical racial, societal, and economic disparities
      • Parikh R.B.
      • Teeple S.
      • Navathe A.S.
      Addressing bias in artificial intelligence in health care.
      ,
      • Wolff R.F.
      • Moons K.G.M.
      • Riley R.D.
      • et al.
      PROBAST: a tool to assess the risk of bias and applicability of prediction model studies.
      ; 4) institution-specific bias, where local peculiarities are learned as rules by the algorithm, which is particularly prominent in single-centre imaging studies
      • Zech J.R.
      • Badgeley M.A.
      • Liu M.
      • et al.
      Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study.
      ; and 5) model overfitting and lack of generalisability when the characteristics of the population in which a model is deployed do not match the characteristics of the population in which the model was trained. Poor model performance is one of the major factors explaining the reluctance of some physicians to use clinical prediction models for decision support.
      • Gottdiener J.S.
      • Fohner A.E.
      Risk prediction in heart failure: new methods, old problems.

      Interpretability of prediction

      The next reason has to do with lack of interpretability of the predictions associated with many ML methods. For many ML methods, including those based on neural networks, there is no direct way to track down how the model generates a prediction from the inputted features; those are colloquially referred to as “black box” models. Whereas interpretability is less important for task-focused algorithms, whose performance against a gold standard is easily measured and monitored, it is of utmost importance for clinical predictions that health care providers rely on as part of augmenting algorithms. Poor interpretability is the most often cited reason for the lack of trust of some physicians in ML-based prediction models.
      • Rudin C.
      Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.
      ,
      • Castelvecchi D.
      Can we open the black box of AI?.

      Deployment infrastructure

      The final set of reasons for the poor track record of deployment of ML models in cardiology is also the least discussed: the lack of infrastructure and framework for deployment and post-deployment model management. Deployment infrastructure is an important consideration. ML models often require a large number of features from various data sources, which requires an integration point for raw data, a data-processing script that often needs to be locally adapted, and a high-performance computer engine, none of which are generally available in standard electronic health record platforms. In addition to technological impediments to deployment, there are also organisational ones that contribute to the poor track record. The regulatory framework around clinical prediction models is in its relative infancy, with many questions regarding ethics, liability, security, and reimbursement still unresolved. The lack of clear guidance on the regulatory framework around those algorithms make planning for future deployment and implementation much more difficult. Guidance on the regulatory status of software as a medical device
      US Food and Drug Administration
      Artificial intelligence/machine learning (AI/ML)–based software as a medical device (SaMD) action plan. January 2021.
      and research into the regulatory evaluation and approval process are emerging.
      • Wu E.
      • Wu K.
      • Daneshjou R.
      • et al.
      How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals.
      ,
      • Benjamens S.
      • Dhunnoo P.
      • Mesko B.
      The state of artificial intelligence-based FDA-approved medical devices and algorithms: an online database.
      There is still a lack of research on how clinicians use and understand clinical predictions, human factors considerations, and how augmentation algorithms affect the organisation of health care organisations’ and clinicians’ cognitive processes. Furthermore, there is no mechanism to manage the unintended consequences of large-scale algorithm deployment. For example, the deployment of smart devices that detect atrial fibrillation led to concerns with overdiagnosis, unnecessary testing secondary to false positive readings, and overdetection of nonpathologic forms of diseases that would otherwise not have caused complications or have required treatment.
      • Brownlee S.
      • Chalkidou K.
      • Doust J.
      • et al.
      Evidence for overuse of medical services around the world.
      • Turakhia M.P.
      • Shafrin J.
      • Bognar K.
      • et al.
      Economic burden of undiagnosed nonvalvular atrial fibrillation in the United States.
      • Moss T.J.
      • Calland J.F.
      • Enfield K.B.
      • et al.
      New-onset atrial fibrillation in the critically ill.
      Finally, there is no consensus on the long-term management of algorithms, including the continuous monitoring of an algorithm’s performance and patient outcomes. As evidenced above, there are multiple barriers to the implementation of ML-based models in clinical settings, some preexisting and some specific to ML. These barriers are a substantial disincentive to algorithm implementation and together explain in large part the high failure rate for the clinical deployment of such models.

      What Do Successful ML Implementations Have in Common?

      Despite those barriers, ML already has had an undeniable impact in clinical cardiology, and many algorithms have been successfully deployed and/or are in the latter stages of deployment. Of the 343 ML-based applications cleared by the FDA as of June 2021, 41 were in the cardiovascular realm and 241 were in radiology (some with use in cardiac imaging). A collection of representative examples of FDA-cleared ML-based applications and other applications that are being used in clinical practice are presented in Table 1.
      • Wang Y.
      • Luo J.
      • Hao S.
      • et al.
      NLP based congestive heart failure case finding: a prospective analysis on statewide electronic medical records.
      ,
      • Meystre S.M.
      • Kim Y.
      • Gobbel G.T.
      • et al.
      Congestive heart failure information extraction framework for automated treatment performance measures assessment.
      ,
      • Picon A.
      • Irusta U.
      • Alvarez-Gila A.
      • et al.
      Mixed convolutional and long short-term memory network for the detection of lethal ventricular arrhythmia.
      • Tison G.H.
      • Sanchez J.M.
      • Ballinger B.
      • et al.
      Passive detection of atrial fibrillation using a commercially available smartwatch.
      • Wasserlauf J.
      • You C.
      • Patel R.
      • et al.
      Smartwatch performance for the detection and quantification of atrial fibrillation.
      • Bumgarner J.M.
      • Lambert C.T.
      • Hussein A.A.
      • et al.
      Smartwatch algorithm for automated detection of atrial fibrillation.
      • Comstock J.
      AliveCor launches ECG-sensing Apple Watch strap, SmartRhythm app. MobiHealthNews, November 30, 2017.
      • Muoio D.
      AliveCor ends sales of KardiaBand, its ECG accessory for Apple Watches. MobiHealthNews, August 19, 2019.
      • Goldenthal I.L.
      • Sciacca R.R.
      • Riga T.
      • et al.
      Recurrent atrial fibrillation/flutter detection after ablation or cardioversion using the AliveCor KardiaMobile device: iHEART results.
      • Nguyen M.T.
      • Nguyen B.V.
      • Kim K.
      Deep feature learning for sudden cardiac arrest detection in automated external defibrillators.
      • Acharya U.R.
      • Fujita H.
      • Oh S.L.
      • et al.
      Automated identification of shockable and nonshockable life-threatening ventricular arrhythmias using convolutional neural network.
      • Attia Z.I.
      • Kapa S.
      • Lopez-Jimenez F.
      • et al.
      Screening for cardiac contractile dysfunction using an artificial intelligence-enabled electrocardiogram.
      • Yao X.
      • McCoy R.G.
      • Friedman P.A.
      • et al.
      ECG AI-Guided Screening for Low Ejection Fraction (EAGLE): rationale and design of a pragmatic cluster randomized trial.
      • Yang C.
      • Aranoff N.D.
      • Green P.
      • Tavassolian N.
      A binary classification of cardiovascular abnormality using time-frequency features of cardio-mechanical signals.
      • Siecinski S.
      • Kostka P.S.
      • Tkacz E.J.
      Heart rate variability analysis on electrocardiograms, seismocardiograms and gyrocardiograms on healthy volunteers.
      • Yang C.
      • Ojha B.D.
      • Aranoff N.D.
      • Green P.
      • Tavassolian N.
      Classification of aortic stenosis using conventional machine learning and deep learning methods based on multi-dimensional cardio-mechanical signals.
      • Knackstedt C.
      • Bekkers S.C.
      • Schummers G.
      • et al.
      Fully automated versus standard tracking of left ventricular ejection fraction and longitudinal strain: the FAST-EFs multicenter study.
      • Zhang J.
      • Gajjala S.
      • Agrawal P.
      • et al.
      Fully automated echocardiogram interpretation in clinical practice.
      • Retson T.A.
      • Masutani E.M.
      • Golden D.
      • Hsiao A.
      Clinical performance and role of expert supervision of deep learning for cardiac ventricular volumetry: a validation study.
      Arterys Cardio DL cloud MRI analytics software receives FDA clearance
      Diagnostic and Interventional Cardiology, January 9, 2017.
      • Itu L.
      • Rapaka S.
      • Passerini T.
      • et al.
      A machine-learning approach for computation of fractional flow reserve from coronary computed tomography.
      • Eberhard M.
      • Nadarevic T.
      • Cousin A.
      • et al.
      Machine learning–based CT fractional flow reserve assessment in acute chest pain: first experience.
      • Ye C.
      • Fu T.
      • Hao S.
      • et al.
      Prediction of incident hypertension within the next year: prospective study using statewide electronic health records and machine learning.
      • Hao S.
      • Fu T.
      • Wu Q.
      • et al.
      Estimating one-year risk of incident chronic kidney disease: retrospective development and validation study using electronic medical record data from the state of Maine.
      • Garvin J.H.
      • Kim Y.
      • Gobbel G.T.
      • et al.
      Automating quality measures for heart failure using natural language processing: a descriptive study in the Department of Veterans Affairs.
      • Kim Y.
      • Garvin J.H.
      • Goldstein M.K.
      • et al.
      Extraction of left ventricular ejection fraction information from various types of clinical reports.
      • Kim Y.
      • Garvin J.
      • Goldstein M.K.
      • Meystre S.M.
      Classification of contextual use of left ventricular ejection fraction assessments.
      • Gobbel G.T.
      • Garvin J.
      • Reeves R.
      • et al.
      Assisted annotation of medical free text using RapTAT.
      • Kim Y.
      • Garvin J.
      • Heavirland J.
      • Meystre S.M.
      Improving heart failure information extraction by domain adaptation.
      • Hannun A.Y.
      • Rajpurkar P.
      • Haghpanahi M.
      • et al.
      Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network.
      • McManus D.D.
      • Lee J.
      • Maitas O.
      • et al.
      A novel application for the detection of an irregular pulse using an iPhone 4S in patients with atrial fibrillation.
      These can be divided into 1) signal processing and diagnostics, 2) image processing and diagnosis, and 3) clinical risk stratification and information retrieval from electronic health records through NLP. The area of cardiology with the highest number of successfully implemented ML-based algorithms by far is signal processing and diagnosis, with many applications in use to detect various arrhythmias (and more recently heart failure
      • Attia Z.I.
      • Kapa S.
      • Lopez-Jimenez F.
      • et al.
      Screening for cardiac contractile dysfunction using an artificial intelligence-enabled electrocardiogram.
      ) through consumer wearables and medical grade wearable devices or embedded in Holter monitors.
      • Tison G.H.
      • Sanchez J.M.
      • Ballinger B.
      • et al.
      Passive detection of atrial fibrillation using a commercially available smartwatch.
      • Wasserlauf J.
      • You C.
      • Patel R.
      • et al.
      Smartwatch performance for the detection and quantification of atrial fibrillation.
      • Bumgarner J.M.
      • Lambert C.T.
      • Hussein A.A.
      • et al.
      Smartwatch algorithm for automated detection of atrial fibrillation.
      • Comstock J.
      AliveCor launches ECG-sensing Apple Watch strap, SmartRhythm app. MobiHealthNews, November 30, 2017.
      • Muoio D.
      AliveCor ends sales of KardiaBand, its ECG accessory for Apple Watches. MobiHealthNews, August 19, 2019.
      • Goldenthal I.L.
      • Sciacca R.R.
      • Riga T.
      • et al.
      Recurrent atrial fibrillation/flutter detection after ablation or cardioversion using the AliveCor KardiaMobile device: iHEART results.
      • Nguyen M.T.
      • Nguyen B.V.
      • Kim K.
      Deep feature learning for sudden cardiac arrest detection in automated external defibrillators.
      ,
      • Hannun A.Y.
      • Rajpurkar P.
      • Haghpanahi M.
      • et al.
      Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network.
      ,
      • McManus D.D.
      • Lee J.
      • Maitas O.
      • et al.
      A novel application for the detection of an irregular pulse using an iPhone 4S in patients with atrial fibrillation.
      ML-based systems have also been used successfully for image processing and diagnostics with the 3 main modalities in cardiology: echocardiography, CT, and magnetic resonance imaging. Successfully implemented applications are used for automated view identification, segmentation of cardiac chambers, quantification of cardiac structures and function, and disease detection.
      • Knackstedt C.
      • Bekkers S.C.
      • Schummers G.
      • et al.
      Fully automated versus standard tracking of left ventricular ejection fraction and longitudinal strain: the FAST-EFs multicenter study.
      • Zhang J.
      • Gajjala S.
      • Agrawal P.
      • et al.
      Fully automated echocardiogram interpretation in clinical practice.
      • Retson T.A.
      • Masutani E.M.
      • Golden D.
      • Hsiao A.
      Clinical performance and role of expert supervision of deep learning for cardiac ventricular volumetry: a validation study.
      Arterys Cardio DL cloud MRI analytics software receives FDA clearance
      Diagnostic and Interventional Cardiology, January 9, 2017.
      • Itu L.
      • Rapaka S.
      • Passerini T.
      • et al.
      A machine-learning approach for computation of fractional flow reserve from coronary computed tomography.
      On the other hand, implementation success for clinical prediction models based on ML algorithms are much rarer in cardiology. For example, Ye et al.
      • Ye C.
      • Fu T.
      • Hao S.
      • et al.
      Prediction of incident hypertension within the next year: prospective study using statewide electronic health records and machine learning.
      successfully applied a supervised ML algorithm, XGBoost, to develop and validate a risk-prediction model of incident essential hypertension within the following year. In addition, ML-based applications for heart failure case finding and automated extraction of medication-related information deployed into clinical practice have been reported.
      • Wang Y.
      • Luo J.
      • Hao S.
      • et al.
      NLP based congestive heart failure case finding: a prospective analysis on statewide electronic medical records.
      ,
      • Meystre S.M.
      • Kim Y.
      • Gobbel G.T.
      • et al.
      Congestive heart failure information extraction framework for automated treatment performance measures assessment.
      Table 1Selected examples of deployed ML applications in cardiology
      StudyPrimary aimType of algorithmData samplePerformanceStage of development
      Signal processing
       Tison et al. (2018)
      • Tison G.H.
      • Sanchez J.M.
      • Ballinger B.
      • et al.
      Passive detection of atrial fibrillation using a commercially available smartwatch.
      Automated detection of AF based on PPG and accelerometer recordings of smartwatchesDeep neural network with heuristic pretrainingHeart rate and step count data obtained using the Cardiogram mobile application on Apple Watches from 9759 Health eHeart Study participantsSensitivity 98.0%, specificity 90.2%, C-statistic 0.97Findings were externally validated as part of the same study
      • Tison G.H.
      • Sanchez J.M.
      • Ballinger B.
      • et al.
      Passive detection of atrial fibrillation using a commercially available smartwatch.
       Wasserlauf et al. (2019)
      • Wasserlauf J.
      • You C.
      • Patel R.
      • et al.
      Smartwatch performance for the detection and quantification of atrial fibrillation.
      Automated detection of AF based on Apple Watch Series 2 or 3 with KardioBand (AliveCor)SmartRhythm 2.0, a convolutional neural networkData of heart rate, activity level, and ECGs from 7500 AliveCor users (training), and data from 24 patients with insertable cardiac monitor and history of paroxysmal AF (validation)Episode sensitivity 97.5%, duration sensitivity 97.7%, patient sensitivity 83.3% overall and 100% during time wornImprovement beyond SmartRhythm 1.0, which was validated in 100 patients undergoing cardioversion
      • Bumgarner J.M.
      • Lambert C.T.
      • Hussein A.A.
      • et al.
      Smartwatch algorithm for automated detection of atrial fibrillation.
      (sensitivity 93% and specificity 84%, 33.7% of recordings noninterpretable), FDA clearance in 2017,
      • Comstock J.
      AliveCor launches ECG-sensing Apple Watch strap, SmartRhythm app. MobiHealthNews, November 30, 2017.
      then ended sales in 2019 as native ECG functionality was coming to the Apple Watch Series 463; Alivecor continues to invest in their KardioMobile technology
      • Goldenthal I.L.
      • Sciacca R.R.
      • Riga T.
      • et al.
      Recurrent atrial fibrillation/flutter detection after ablation or cardioversion using the AliveCor KardiaMobile device: iHEART results.
       Nguyen et al. (2018)
      • Nguyen M.T.
      • Nguyen B.V.
      • Kim K.
      Deep feature learning for sudden cardiac arrest detection in automated external defibrillators.
      Detection of ventricular fibrillation and ventricular tachycardia (shockable rhythms) to improve shock advice algorithms in automated external defibrillatorsConvolutional neural network as a feature extractor and boosting classifier1135 shockable segments and 5185 nonshockable segments from 57 records in public databases (Creighton University Ventricular Tachyarrhythmia Database and the MIT-BIH Malignant Ventricular Arrhythmia Database)Accuracy 99.3%, sensitivity 97.1%, specificity 99.4% (compared with sensitivity 90% and specificity 95% in current American Heart Association algorithms)Similar results have been reported in models developed by Picon et al.
      • Picon A.
      • Irusta U.
      • Alvarez-Gila A.
      • et al.
      Mixed convolutional and long short-term memory network for the detection of lethal ventricular arrhythmia.
      and Acharya et al.
      • Acharya U.R.
      • Fujita H.
      • Oh S.L.
      • et al.
      Automated identification of shockable and nonshockable life-threatening ventricular arrhythmias using convolutional neural network.
       Attia et al. (2019)
      • Attia Z.I.
      • Kapa S.
      • Lopez-Jimenez F.
      • et al.
      Screening for cardiac contractile dysfunction using an artificial intelligence-enabled electrocardiogram.
      Identification of asymptomatic LV dysfunction based on ECGConvolutional neural network using the Keras framework with a Tensorflow (Google) backend and PythonECG-TTE pairs: 35,970 (training), 8989 (internal validation), 52,870 (testing)Accuracy 85.7%, sensitivity 86.3%, specificity 85.7%, C-statistic 0.93Will be validated in EAGLE trial
      • Yao X.
      • McCoy R.G.
      • Friedman P.A.
      • et al.
      ECG AI-Guided Screening for Low Ejection Fraction (EAGLE): rationale and design of a pragmatic cluster randomized trial.
       Yang et al. (2018)
      • Yang C.
      • Aranoff N.D.
      • Green P.
      • Tavassolian N.
      A binary classification of cardiovascular abnormality using time-frequency features of cardio-mechanical signals.
      Binary classification of cardiovascular abnormality using time-frequency features of cardiomechanical signals, namely, SCG and GCG signalsDecision tree and SVM methods with features generated by continuous wavelet transformExperimental measurements from 12 patients with cardiovascular diseases (age 68 ± 18 years) and 12 healthy subjects (age 25 ± 2 years)Accuracy > 94%, with best performance of SVM applied to GCG features (99.5%)Analysis of cardiomechanical signals has become adapted by other studies,
      • Siecinski S.
      • Kostka P.S.
      • Tkacz E.J.
      Heart rate variability analysis on electrocardiograms, seismocardiograms and gyrocardiograms on healthy volunteers.
      ,
      • Yang C.
      • Ojha B.D.
      • Aranoff N.D.
      • Green P.
      • Tavassolian N.
      Classification of aortic stenosis using conventional machine learning and deep learning methods based on multi-dimensional cardio-mechanical signals.
      and SCG and GCG signal data have become available at PhysioNet Cardiovascular Toolbox
      Image processing
       Knwackstedt et al. (2015)
      • Knackstedt C.
      • Bekkers S.C.
      • Schummers G.
      • et al.
      Fully automated versus standard tracking of left ventricular ejection fraction and longitudinal strain: the FAST-EFs multicenter study.
      Rapid and reproducible measurement of LV volumes, EF, and average biplane LS on ECG4- and 2-chamber ECG views from 255 patients in sinus rhythmFeasibility 98%, good agreements with reference for automated EF and LS, with no variabilityTOMTEC AutoLV
       Zhang et al. (2018)
      • Zhang J.
      • Gajjala S.
      • Agrawal P.
      • et al.
      Fully automated echocardiogram interpretation in clinical practice.
      Automated ECG interpretation, including view identification, segmentation of cardiac chambers across 5 common views, quantification of structures and function, and disease detectionConvolutional neural networks14,035 echocardiograms spanning a 10-year periodIdentification of views in > 95%, median absolute deviation 15%-17% for structure and < 10% for function, detection of hypertrophic cardiomyopathy, cardiac amyloidosis, and pulmonary disease with C-statistics of 0.93, 0.87, and 0.85, respectivelyEchoCV web-based fully automated echocardiogram interpretation system has been patented and made available online
       Retson et al. (2020)
      • Retson T.A.
      • Masutani E.M.
      • Golden D.
      • Hsiao A.
      Clinical performance and role of expert supervision of deep learning for cardiac ventricular volumetry: a validation study.
      Measurement of RV and LV volume and function in MRI images for a range of clinical indications and pathologies200 noncongenital clinical cardiac MRI examinationsStrong correlations for LV (> 0.94) and RV (> 0.92) volumesArterys Cardio DL received FDA clearance in 2017
      Arterys Cardio DL cloud MRI analytics software receives FDA clearance
      Diagnostic and Interventional Cardiology, January 9, 2017.
       Itu et al. (2016)
      • Itu L.
      • Rapaka S.
      • Passerini T.
      • et al.
      A machine-learning approach for computation of fractional flow reserve from coronary computed tomography.
      Decreasing computational demand of FFR calculation by developing a ML-based model as an alternative to computational fluid dynamics approachesDeep neural network125 lesions in 87 patient-specific anatomic models generated from CT data using image segmentationExcellent correlation (0.9994; P < 0.001) and no systematic bias in Bland-Altman analysis: FFR ≤ 0.80 was predicted with sensitivity 81.6%, specificity 83.9%, and accuracy 83.2%Applied and tested in various settings
      • Eberhard M.
      • Nadarevic T.
      • Cousin A.
      • et al.
      Machine learning–based CT fractional flow reserve assessment in acute chest pain: first experience.
      Clinical risk stratification
       Ye et al. (2018)
      • Ye C.
      • Fu T.
      • Hao S.
      • et al.
      Prediction of incident hypertension within the next year: prospective study using statewide electronic health records and machine learning.
      Risk prediction model of incident essential hypertension within the following yearFeature selection and generation of ensemble of classification trees with the use of XGBoostData from individual patient electronic health records as part of the Health Information Exchange data set of Maine (823,627 for retrospective cohort and 680,810 for prospective cohort)C-statistics of 0.917 in the retrospective cohort and 0.870 in the prospective cohortDeployed in the state of Maine, where similar risk models have been implemented for chronic kidney disease
      • Hao S.
      • Fu T.
      • Wu Q.
      • et al.
      Estimating one-year risk of incident chronic kidney disease: retrospective development and validation study using electronic medical record data from the state of Maine.
       Meystre et al. (2017)
      • Meystre S.M.
      • Kim Y.
      • Gobbel G.T.
      • et al.
      Congestive heart failure information extraction framework for automated treatment performance measures assessment.
      Development of CHIEF to automatically extract LV function mentions and values, congestive HF medications, and documented reasons for a patient not receiving these medicationsCombination of rules, dictionaries, and ML methodsVarious clinical notes from 1083 Veterans Health Administration patientsHigh recall and precision for HF medications and EF (> 0.960), while only reaching fair recall and precision for reasons for not prescribing HF medications (< 0.400)Additional improvements of the model have been reported
      • Garvin J.H.
      • Kim Y.
      • Gobbel G.T.
      • et al.
      Automating quality measures for heart failure using natural language processing: a descriptive study in the Department of Veterans Affairs.
      • Kim Y.
      • Garvin J.H.
      • Goldstein M.K.
      • et al.
      Extraction of left ventricular ejection fraction information from various types of clinical reports.
      • Kim Y.
      • Garvin J.
      • Goldstein M.K.
      • Meystre S.M.
      Classification of contextual use of left ventricular ejection fraction assessments.
      • Gobbel G.T.
      • Garvin J.
      • Reeves R.
      • et al.
      Assisted annotation of medical free text using RapTAT.
      • Kim Y.
      • Garvin J.
      • Heavirland J.
      • Meystre S.M.
      Improving heart failure information extraction by domain adaptation.
      as part of the Automated Data Acquisition for Heart Failure project
       Wang et al. (2015)
      • Wang Y.
      • Luo J.
      • Hao S.
      • et al.
      NLP based congestive heart failure case finding: a prospective analysis on statewide electronic medical records.
      Identification of HF cases from both structured and unstructured EMRsRandom forest models2,139,299 notes in the Maine Health Information Exchange EMR database from July 1, 2012, to June 30, 2014Positive predictive value of 91.4%Integrated into the Maine Health Information Exchange live system
      AF, atrial fibrillation; CHIEF, Congestive Heart Failure Treatment Performance Measure Information Extraction Framework; CT, computed tomography; EAGLE, ECG AI-Guided Screening for Low Ejection Fraction; ECG, electrocardiography; EF, ejection fraction; EMR, electronic medical record; FDA, US Food and Drug Administration; FFR, fractional flow reserve; GCG, gyrocardiography; HF, heart failure; LS, longitudinal strain; LV, left ventricular; MIT-BIH, Massachusetts Institute of Technology-Beth Israel Hospital; ML, machine learning; MRI, magnetic resonance imaging; PPG, photoplethysmography; RV, right ventricular; SCG, seismocardiography; SVM, support vector machine; TTE, transthoracic echocardiography.
      Successful implementation examples in cardiology have many points in common. First, they tend to come from clinical areas where large amounts of training data are available owing to the early digitisation of the information in an easily retrievable format (eg, imaging and signal processing). These applications also can be deployed into a single diagnostic modality or with the acquisition of only a few clinical features and do not rely on a complex data integration infrastructure. Next, those models are highly predictive, generally with AUC > 0.90 in external validation cohorts and clinical trials. Finally, and perhaps most importantly, most of these applications have been developed by or in collaboration with large medical device manufacturers who have decades of experience with approval, deployment, and implementation of novel technology in the clinical area, as opposed to academic researchers who are generally more focused on fundamental discoveries.

      The Path Forward: How to Improve the Rate of Successful Implementation in the Future

      From this review of ML systems implementation successes and failures in cardiology, it is clear that successful applications are limited to those where either the implementation issues are contained or simple (eg, single-modality deployment, risk scores) or where the implementation matters have been integrated into the technology development cycle (eg, wearables). Multimodality ML systems and those that include large amounts of between-centre variations (eg, natural language) have lower rates of successful implementation.
      Based on the observations above, the path forward to improve the transition rate of ML systems in cardiology from showcase projects to real applications with full deployment that have a direct impact on patient care will require both deep environmental changes and a new technology development framework that addresses implementation issues at the project onset. From a medical environment point of view, improving the successful transition rate will require rethinking sharing and access to large-scale multicentre training data. As highlighted, the easy availability of large repositories of ECG data is one of the main reasons why many of the success stories for ML systems in cardiology are in the arrhythmia domain.
      • Moody G.B.
      • Mark R.G.
      The impact of the MIT-BIH arrhythmia database.
      Such resources are emerging for many cardiac imaging modalities,
      • Dey D.
      • Slomka P.J.
      • Leeson P.
      • et al.
      Artificial intelligence in cardiovascular imaging: JACC state-of-the-art review.
      but there are few of these publicly available resources for clinical data, particularly when deep phenotypic information is needed. The second major environmental change will need to be in the creation and management of medical data. The current reliance of free text with nonsystematic terminology makes it very onerous to generate high-quality data that can be used in algorithms, and the current practice of using health care providers as data entry clerks is not sustainable. New strategies that incorporate real-time data acquisition, systematic terminology, information modelling, and data standards incorporated in electronic health record platforms will be necessary, particularly for the successful implementation of clinical risk prediction models that are not based on a single-modality diagnostic system.
      • Westcott R.J.
      • Tcheng J.E.
      Artificial intelligence and machine learning in cardiology.
      Finally, education of health care providers on AI will be necessary for the future end-user of ML systems to fully understand the benefits but also the limitations of the technology and have reasonable expectations toward ML systems.
      • Kolachalama V.B.
      • Garg P.S.
      Machine learning and medical education.
      Beyond environmental changes, there is also a need to adopt a more comprehensive framework in the development of ML systems for cardiology and for medicine in general. Part of this framework is already emerging with new guidelines for the reporting of clinical prediction models in the scientific literature,
      • Collins G.S.
      • Reitsma J.B.
      • Altman D.G.
      • Moons K.G.
      TRIPOD Group
      Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD): the TRIPOD statement.
      deep guidance on the proper design, building, and predeployment evaluation of AI systems in cardiology,
      • Sengupta P.P.
      • Shrestha S.
      • Berthon B.
      • et al.
      Proposed Requirements for Cardiovascular Imaging–Related Machine Learning Evaluation (PRIME): a checklist: reviewed by the American College of Cardiology Healthcare Innovation Council.
      and general guidance on implementation, deployment, and regulatory approval of ML systems in clinical medicine in general.
      • Wiens J.
      • Saria S.
      • Sendak M.
      • et al.
      Do no harm: a roadmap for responsible machine learning for health care.
      ,
      • He J.
      • Baxter S.L.
      • Xu J.
      • et al.
      The practical implementation of artificial intelligence technologies in medicine.
      However, the data presented in this review suggest that an even more comprehensive implementation framework might be needed for the development of AI systems that can be successfully deployed for the benefit of patients with heart disease. In this framework, the process would start with a comprehensive analysis of the opportunity and the appropriateness of AI systems to provide a suitable solution and would include the design of a technical clinical implementation plan focusing specifically on availability of source data and data integration strategy. This framework would obviously include proper algorithm design, reporting, and evaluation, the latter including a deep evaluation of model performance in an external cohort, with varying levels of missing data, and in specific patient populations that might be more prone to biased predictions. Clinical deployment should include both a full human factor evaluation and shadow testing (testing under real-life conditions but without the results being made available to clinicians or patients). Finally, the full implementation framework should include continuous monitoring of the algorithms after deployment, including performance, indication of biased behaviour (both known and unintended), impact on patient outcomes, and unintended consequences.

      Conclusion

      ML in cardiology is associated with substantial promise, great interest, and a large amount of work and resources dedicated to it, and there have been undeniable successes that are already having an impact on patient care. However, the rate of successful implementation of ML systems in clinical practice to date has been limited, and as such, the full impact on patients remains to be realised. This being said, as more systems get successfully implemented in clinical practice and in consumer-oriented and medical-grade wearable devices,
      • Krittanawong C.
      • Rogers A.J.
      • Johnson K.W.
      • et al.
      Integration of novel monitoring devices with machine learning technology for scalable cardiovascular management.
      the impact of ML technology in cardiology will surely deepen over time. In this review, we have highlighted some of the considerations and challenges necessary to build and implement ML systems in cardiology. Because ML in cardiology has only just begun to return dividends, this will be an important step to prevent a future decoupling in the size of the clinical impact of ML in cardiology vs the amount of resources that are dedicated to its development, which could lead to disappointment and ultimately disinvestment in this potentially revolutionising technology. We also highlighted that ML in cardiology will reach its full potential and positive impact on patient health only by leveraging a framework where technology development and implementation planning occur in parallel as opposed to implementation being a secondary consideration.

      Acknowledgements

      We thank Grace Howard and Madeline Cheshire for their contribution to the literature review. Jef van den Eynde is supported by the Belgian American Educational Foundation.

      Funding Sources

      The authors have no funding sources to declare.

      Disclosures

      Dr Kutty is a consultant for GE Healthcare. The other authors have no conflicts of interest to disclose.

      References

        • UNESCO
        The race against time for smarter development. UNESCO science report.
        (Available at:)
        • Jiang F.
        • Jiang Y.
        • Zhi H.
        • et al.
        Artificial intelligence in healthcare: past, present and future.
        Stroke Vasc Neurol. 2017; 2: 230-243
        • Miller D.D.
        Machine intelligence for management of acute coronary syndromes: neural or nervous times?.
        Can J Cardiol. 2020; 36: 470-473
        • Modine T.
        • Overtchouk P.
        Machine learning is no magic: a plea for critical appraisal during periods of hype.
        JACC Cardiovasc Interv. 2019; 12: 1339-1341
        • Keane P.A.
        • Topol E.J.
        With an eye to AI and autonomous diagnosis.
        NPJ Digit Med. 2018; 1: 40
        • Quer G.
        • Arnaout R.
        • Henne M.
        • Arnaout R.
        Machine learning and the future of cardiovascular care: JACC state-of-the-art review.
        J Am Coll Cardiol. 2021; 77: 300-313
        • Johnson K.W.
        • Torres Soto J.
        • Glicksberg B.S.
        • et al.
        Artificial intelligence in cardiology.
        J Am Coll Cardiol. 2018; 71: 2668-2679
        • Shameer K.
        • Johnson K.W.
        • Glicksberg B.S.
        • Dudley J.T.
        • Sengupta P.P.
        Machine learning in cardiovascular medicine: are we there yet?.
        Heart. 2018; 104: 1156-1164
        • Feeny A.K.
        • Chung M.K.
        • Madabhushi A.
        • et al.
        Artificial intelligence and machine learning in arrhythmias and cardiac electrophysiology.
        Circ Arrhythm Electrophysiol. 2020; 13e007952
        • Al’Aref S.J.
        • Anchouche K.
        • Singh G.
        • et al.
        Clinical applications of machine learning in cardiovascular disease and its relevance to cardiac imaging.
        Eur Heart J. 2019; 40: 1975-1986
        • Dey D.
        • Slomka P.J.
        • Leeson P.
        • et al.
        Artificial intelligence in cardiovascular imaging: JACC state-of-the-art review.
        J Am Coll Cardiol. 2019; 73: 1317-1335
        • Litjens G.
        • Ciompi F.
        • Wolterink J.M.
        • et al.
        State-of-the-art deep learning in cardiovascular image analysis.
        JACC Cardiovasc Imaging. 2019; 12: 1549-1565
        • Krittanawong C.
        • Johnson K.W.
        • Rosenson R.S.
        • et al.
        Deep learning for cardiovascular medicine: a practical primer.
        Eur Heart J. 2019; 40: 2058-2073
        • Kusunose K.
        • Haga A.
        • Abe T.
        • Sata M.
        Utilization of artificial intelligence in echocardiography.
        Circ J. 2019; 83: 1623-1629
        • Sardar P.
        • Abbott J.D.
        • Kundu A.
        • et al.
        Impact of artificial intelligence on interventional cardiology: from decision-making aid to advanced interventional procedure assistance.
        JACC Cardiovasc Interv. 2019; 12: 1293-1303
        • Siontis K.C.
        • Yao X.
        • Pirruccello J.P.
        • Philippakis A.A.
        • Noseworthy P.A.
        How will machine learning inform the clinical care of atrial fibrillation?.
        Circ Res. 2020; 127: 155-169
        • Shin S.
        • Austin P.C.
        • Ross H.J.
        • et al.
        Machine learning vs conventional statistical models for predicting heart failure readmission and mortality.
        ESC Heart Fail. 2021; 8: 106-115
        • Turing A.
        Computing machinery and intelligence.
        Mind. 1950; 49: 433-460
        • Samuel A.L.
        Some studies in machine learning using the game of checkers.
        IBM J Res Dev. 1959; 44: 206-226
        • Russell S.J.
        • Peters N.S.
        Artificial Intelligence: A Modern Approach.
        4th ed. Pearson, Hoboken2021
        • Fjelland R.
        Why general artificial intelligence will not be realized.
        Humanit Soc Sci Commun. 2020; 7: 10
      1. Matheny M. Israni S.T. Ahmed M. Whicher D. Artificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril. National Academy of Medicine, Washington2019
        • Katz D.H.
        • Deo R.C.
        • Aguilar F.G.
        • et al.
        Phenomapping for the identification of hypertensive patients with the myocardial substrate for heart failure with preserved ejection fraction.
        J Cardiovasc Transl Res. 2017; 10: 275-284
        • Horiuchi Y.
        • Tanimoto S.
        • Latif A.
        • et al.
        Identifying novel phenotypes of acute heart failure using cluster analysis of clinical variables.
        Int J Cardiol. 2018; 262: 57-63
        • Komorowski M.
        • Celi L.A.
        • Badawi O.
        • Gordon A.C.
        • Faisal A.A.
        The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care.
        Nat Med. 2018; 24: 1716-1720
        • Christodoulou E.
        • Ma J.
        • Collins G.S.
        • et al.
        A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models.
        J Clin Epidemiol. 2019; 110: 12-22
        • Rudin C.
        Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.
        Nat Mach Intell. 2019; 1: 206-215
        • Lauritsen S.M.
        • Kristensen M.
        • Olsen M.V.
        • et al.
        Explainable artificial intelligence model to predict acute critical illness from electronic health records.
        Nat Commun. 2020; 11: 3852
        • Goldberger A.L.
        • Amaral L.A.
        • Glass L.
        • et al.
        PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals.
        Circulation. 2000; 101: E215-E220
        • Leclerc S.
        • Smistad E.
        • Pedrosa J.
        • et al.
        Deep learning for segmentation using an open large-scale dataset in 2D echocardiography.
        IEEE Trans Med Imaging. 2019; 38: 2198-2210
        • Bunting K.V.
        • Steeds R.P.
        • Slater L.T.
        • et al.
        A practical guide to assess the reproducibility of echocardiographic measurements.
        J Am Soc Echocardiogr. 2019; 32: 1505-1515
        • Sudlow C.
        • Gallacher J.
        • Allen N.
        • et al.
        UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age.
        PLoS Med. 2015; 12e1001779
        • Furukawa M.F.
        • King J.
        • Patel V.
        • et al.
        Despite substantial progress in EHR adoption, health information exchange and patient engagement remain low in office settings.
        Health Aff (Millwood). 2014; 33: 1672-1679
        • Jha A.K.
        The promise of electronic records: around the corner or down the road?.
        JAMA. 2011; 306: 880-881
        • Wang Y.
        • Luo J.
        • Hao S.
        • et al.
        NLP based congestive heart failure case finding: a prospective analysis on statewide electronic medical records.
        Int J Med Inform. 2015; 84: 1039-1047
        • Meystre S.M.
        • Kim Y.
        • Gobbel G.T.
        • et al.
        Congestive heart failure information extraction framework for automated treatment performance measures assessment.
        J Am Med Inform Assoc. 2017; 24: e40-e46
        • Parikh R.B.
        • Teeple S.
        • Navathe A.S.
        Addressing bias in artificial intelligence in health care.
        JAMA. 2019; 322: 2377-2378
        • Wolff R.F.
        • Moons K.G.M.
        • Riley R.D.
        • et al.
        PROBAST: a tool to assess the risk of bias and applicability of prediction model studies.
        Ann Intern Med. 2019; 170: 51-58
        • Zech J.R.
        • Badgeley M.A.
        • Liu M.
        • et al.
        Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study.
        PLoS Med. 2018; 15e1002683
        • Moons K.G.
        • Altman D.G.
        • Reitsma J.B.
        • et al.
        Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD): explanation and elaboration.
        Ann Intern Med. 2015; 162: W1-W73
        • Steyerberg E.W.
        • Bleeker S.E.
        • Moll H.A.
        • Grobbee D.E.
        • Moons K.G.
        Internal and external validation of predictive models: a simulation study of bias and precision in small samples.
        J Clin Epidemiol. 2003; 56: 441-447
        • Moons K.G.
        • Kengne A.P.
        • Grobbee D.E.
        • et al.
        Risk prediction models: II. External validation, model updating, and impact assessment.
        Heart. 2012; 98: 691-698
        • Wessler B.S.
        • Nelson J.
        • Park J.G.
        • et al.
        External validations of cardiovascular clinical prediction models: a large-scale review of the literature.
        Circ Cardiovasc Qual Outcomes. 2021; 14e007858
        • Alba A.C.
        • Agoritsas T.
        • Walsh M.
        • et al.
        Discrimination and calibration of clinical prediction models: users’ guides to the medical literature.
        JAMA. 2017; 318: 1377-1384
        • Pant S.
        • Patel N.J.
        • Deshmukh A.
        • et al.
        Trends in infective endocarditis incidence, microbiology, and valve replacement in the United States from 2000 to 2011.
        J Am Coll Cardiol. 2015; 65: 2070-2076
        • Muehlematter U.J.
        • Daniore P.
        • Vokinger K.N.
        Approval of artificial intelligence and machine learning-based medical devices in the USA and Europe (2015-20): a comparative analysis.
        Lancet Digit Health. 2021; 3: e195-e203
        • Wessler B.S.
        • Lai Yh L.
        • Kramer W.
        • et al.
        Clinical prediction models for cardiovascular disease: Tufts Predictive Analytics and Comparative Effectiveness Clinical Prediction Model database.
        Circ Cardiovasc Qual Outcomes. 2015; 8: 368-375
        • Beam A.L.
        • Kohane I.S.
        Big data and machine learning in health care.
        JAMA. 2018; 319: 1317-1318
        • Krittanawong C.
        • Virk H.U.H.
        • Bangalore S.
        • et al.
        Machine learning prediction in cardiovascular diseases: a meta-analysis.
        Sci Rep. 2020; 10: 16057
        • Gottdiener J.S.
        • Fohner A.E.
        Risk prediction in heart failure: new methods, old problems.
        JACC Heart Fail. 2020; 8: 22-24
        • Castelvecchi D.
        Can we open the black box of AI?.
        Nature. 2016; 538: 20-23
        • US Food and Drug Administration
        Artificial intelligence/machine learning (AI/ML)–based software as a medical device (SaMD) action plan. January 2021.
        (Available at:) (Accessed October 1, 2021)
        • Wu E.
        • Wu K.
        • Daneshjou R.
        • et al.
        How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals.
        Nat Med. 2021; 27: 582-584
        • Benjamens S.
        • Dhunnoo P.
        • Mesko B.
        The state of artificial intelligence-based FDA-approved medical devices and algorithms: an online database.
        NPJ Digit Med. 2020; 3: 118
        • Brownlee S.
        • Chalkidou K.
        • Doust J.
        • et al.
        Evidence for overuse of medical services around the world.
        Lancet. 2017; 390: 156-168
        • Turakhia M.P.
        • Shafrin J.
        • Bognar K.
        • et al.
        Economic burden of undiagnosed nonvalvular atrial fibrillation in the United States.
        Am J Cardiol. 2015; 116: 733-739
        • Moss T.J.
        • Calland J.F.
        • Enfield K.B.
        • et al.
        New-onset atrial fibrillation in the critically ill.
        Crit Care Med. 2017; 45: 790-797
        • Picon A.
        • Irusta U.
        • Alvarez-Gila A.
        • et al.
        Mixed convolutional and long short-term memory network for the detection of lethal ventricular arrhythmia.
        PLoS One. 2019; 14e0216756
        • Tison G.H.
        • Sanchez J.M.
        • Ballinger B.
        • et al.
        Passive detection of atrial fibrillation using a commercially available smartwatch.
        JAMA Cardiol. 2018; 3: 409-416
        • Wasserlauf J.
        • You C.
        • Patel R.
        • et al.
        Smartwatch performance for the detection and quantification of atrial fibrillation.
        Circ Arrhythm Electrophysiol. 2019; 12e006834
        • Bumgarner J.M.
        • Lambert C.T.
        • Hussein A.A.
        • et al.
        Smartwatch algorithm for automated detection of atrial fibrillation.
        J Am Coll Cardiol. 2018; 71: 2381-2388
        • Comstock J.
        AliveCor launches ECG-sensing Apple Watch strap, SmartRhythm app. MobiHealthNews, November 30, 2017.
        (Available at:) (Accessed October 1, 2021)
        • Muoio D.
        AliveCor ends sales of KardiaBand, its ECG accessory for Apple Watches. MobiHealthNews, August 19, 2019.
        (Available at:) (Accessed October 1, 2021)
        • Goldenthal I.L.
        • Sciacca R.R.
        • Riga T.
        • et al.
        Recurrent atrial fibrillation/flutter detection after ablation or cardioversion using the AliveCor KardiaMobile device: iHEART results.
        J Cardiovasc Electrophysiol. 2019; 30: 2220-2228
        • Nguyen M.T.
        • Nguyen B.V.
        • Kim K.
        Deep feature learning for sudden cardiac arrest detection in automated external defibrillators.
        Sci Rep. 2018; 8: 17196
        • Acharya U.R.
        • Fujita H.
        • Oh S.L.
        • et al.
        Automated identification of shockable and nonshockable life-threatening ventricular arrhythmias using convolutional neural network.
        uture Gener Comput Syst. 2018; 79: 952-959
        • Attia Z.I.
        • Kapa S.
        • Lopez-Jimenez F.
        • et al.
        Screening for cardiac contractile dysfunction using an artificial intelligence-enabled electrocardiogram.
        Nat Med. 2019; 25: 70-74
        • Yao X.
        • McCoy R.G.
        • Friedman P.A.
        • et al.
        ECG AI-Guided Screening for Low Ejection Fraction (EAGLE): rationale and design of a pragmatic cluster randomized trial.
        Am Heart J. 2020; 219: 31-36
        • Yang C.
        • Aranoff N.D.
        • Green P.
        • Tavassolian N.
        A binary classification of cardiovascular abnormality using time-frequency features of cardio-mechanical signals.
        Annu Int Conf IEEE Eng Med Biol Soc. 2018; 2018: 5438-5441
        • Siecinski S.
        • Kostka P.S.
        • Tkacz E.J.
        Heart rate variability analysis on electrocardiograms, seismocardiograms and gyrocardiograms on healthy volunteers.
        Sensors (Basel). 2020; 20: 4522
        • Yang C.
        • Ojha B.D.
        • Aranoff N.D.
        • Green P.
        • Tavassolian N.
        Classification of aortic stenosis using conventional machine learning and deep learning methods based on multi-dimensional cardio-mechanical signals.
        Sci Rep. 2020; 10: 17521
        • Knackstedt C.
        • Bekkers S.C.
        • Schummers G.
        • et al.
        Fully automated versus standard tracking of left ventricular ejection fraction and longitudinal strain: the FAST-EFs multicenter study.
        J Am Coll Cardiol. 2015; 66: 1456-1466
        • Zhang J.
        • Gajjala S.
        • Agrawal P.
        • et al.
        Fully automated echocardiogram interpretation in clinical practice.
        Circulation. 2018; 138: 1623-1635
        • Retson T.A.
        • Masutani E.M.
        • Golden D.
        • Hsiao A.
        Clinical performance and role of expert supervision of deep learning for cardiac ventricular volumetry: a validation study.
        Radiol Artif Intell. 2020; 2e190064
        • Arterys Cardio DL cloud MRI analytics software receives FDA clearance
        Diagnostic and Interventional Cardiology, January 9, 2017.
        (Available at:) (Accessed October 1, 2021)
        • Itu L.
        • Rapaka S.
        • Passerini T.
        • et al.
        A machine-learning approach for computation of fractional flow reserve from coronary computed tomography.
        J Appl Physiol (1985). 2016; 121: 42-52
        • Eberhard M.
        • Nadarevic T.
        • Cousin A.
        • et al.
        Machine learning–based CT fractional flow reserve assessment in acute chest pain: first experience.
        Cardiovasc Diagn Ther. 2020; 10: 820-830
        • Ye C.
        • Fu T.
        • Hao S.
        • et al.
        Prediction of incident hypertension within the next year: prospective study using statewide electronic health records and machine learning.
        J Med Internet Res. 2018; 20: e22
        • Hao S.
        • Fu T.
        • Wu Q.
        • et al.
        Estimating one-year risk of incident chronic kidney disease: retrospective development and validation study using electronic medical record data from the state of Maine.
        JMIR Med Inform. 2017; 5: e21
        • Garvin J.H.
        • Kim Y.
        • Gobbel G.T.
        • et al.
        Automating quality measures for heart failure using natural language processing: a descriptive study in the Department of Veterans Affairs.
        JMIR Med Inform. 2018; 6: e5
        • Kim Y.
        • Garvin J.H.
        • Goldstein M.K.
        • et al.
        Extraction of left ventricular ejection fraction information from various types of clinical reports.
        J Biomed Inform. 2017; 67: 42-48
        • Kim Y.
        • Garvin J.
        • Goldstein M.K.
        • Meystre S.M.
        Classification of contextual use of left ventricular ejection fraction assessments.
        Stud Health Technol Inform. 2015; 216: 599-603
        • Gobbel G.T.
        • Garvin J.
        • Reeves R.
        • et al.
        Assisted annotation of medical free text using RapTAT.
        J Am Med Inform Assoc. 2014; 21: 833-841
        • Kim Y.
        • Garvin J.
        • Heavirland J.
        • Meystre S.M.
        Improving heart failure information extraction by domain adaptation.
        Stud Health Technol Inform. 2013; 192: 185-189
        • Hannun A.Y.
        • Rajpurkar P.
        • Haghpanahi M.
        • et al.
        Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network.
        Nat Med. 2019; 25: 65-69
        • McManus D.D.
        • Lee J.
        • Maitas O.
        • et al.
        A novel application for the detection of an irregular pulse using an iPhone 4S in patients with atrial fibrillation.
        Heart Rhythm. 2013; 10: 315-319
        • Moody G.B.
        • Mark R.G.
        The impact of the MIT-BIH arrhythmia database.
        IEEE Eng Med Biol Mag. 2001; 20: 45-50
        • Westcott R.J.
        • Tcheng J.E.
        Artificial intelligence and machine learning in cardiology.
        JACC Cardiovasc Interv. 2019; 12: 1312-1314
        • Kolachalama V.B.
        • Garg P.S.
        Machine learning and medical education.
        NPJ Digit Med. 2018; 1: 54
        • Collins G.S.
        • Reitsma J.B.
        • Altman D.G.
        • Moons K.G.
        • TRIPOD Group
        Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD): the TRIPOD statement.
        Circulation. 2015; 131: 211-219
        • Sengupta P.P.
        • Shrestha S.
        • Berthon B.
        • et al.
        Proposed Requirements for Cardiovascular Imaging–Related Machine Learning Evaluation (PRIME): a checklist: reviewed by the American College of Cardiology Healthcare Innovation Council.
        JACC Cardiovasc Imaging. 2020; 13: 2017-2035
        • Wiens J.
        • Saria S.
        • Sendak M.
        • et al.
        Do no harm: a roadmap for responsible machine learning for health care.
        Nat Med. 2019; 25: 1337-1340
        • He J.
        • Baxter S.L.
        • Xu J.
        • et al.
        The practical implementation of artificial intelligence technologies in medicine.
        Nat Med. 2019; 25: 30-36
        • Krittanawong C.
        • Rogers A.J.
        • Johnson K.W.
        • et al.
        Integration of novel monitoring devices with machine learning technology for scalable cardiovascular management.
        Nat Rev Cardiol. 2021; 18: 75-91