1.IntroductionThe tracking of interventional devices is an important prerequisite for interventional specialists during catheterized cardiac interventions, such as percutaneous coronary interventions (PCIs), cardiac electrophysiology, or transarterial chemoembolization.1–3 Tracking the tip of the catheter as a visual guidance facilitates navigation to the desired anatomy. Furthermore, the tip of the catheter serves as an anchor point separating the catheter from the vessel structures. The anchor point can provide a basis for mapping angiography (high-dose X-ray with an injected contrast agent) to fluoroscopy (low-dose X-ray), thereby reducing the usage of contrast for visualizing vessels.1,4 To co-register intravascular ultrasonography with angiography and perform a complete examination of the vessel, lumen, and wall structure, catheter tip tracking also offers a significant cue.5–7 However, tracking the tip of the catheter in X-ray images can be challenging in the presence of various occlusions due to the contrast agent and other devices. This is in addition to the cardiac and breathing motion of the patient. Recently, self-supervised learning methods have been developed with the aim to learn general features from unlabeled data to boost the performance in various natural sequence imaging tasks. Most self-supervised pretraining methods learn such features by identifying and removing inherent redundancies from sequence image data. VideoMAE8 conducts temporal downsampling on the pixel level followed by symmetrical masking over all of the sampled frames with a high masking ratio of 90%. This deliberate design choice prevents the network from learning fine inter-frame correspondences. SiamMAE9 improves upon this baseline using highly asymmetric masking. However, the proposed asymmetric masking requires feeding in the first frame entirely with 0% masking, which increases the computation complexity quadratically and prevents the network from learning spatio-temporal features over a longer period of time. The space-time semantics in interventional cardiac image sequences differ from natural videos in terms of both redundancies and motion. For example, visibility may largely vary based on X-ray dosage along with varying motion based on the acquisition frame-rate, patient’s breathing and cardiac motion. In angiography sequences, vessels have high structural similarity with devices, such as catheters and guidewires, and can gradually appear or disappear over time. To address these challenges, in this work, we bring the following contributions in terms of both self-supervised pretraining and the downstream device tracking.
2.Related Work2.1.Self-Supervised LearningThese methods have been used in a variety of contexts to learn features from unlabeled data that boost the performance in downstream tasks, such as using pretext tasks13–15 and contrastive learning.16–21 In the space of sequential image data processing (e.g., video), temporal information has been leveraged in various ways.22–28 However, self-supervised methods based on masked image modeling (MIM), in which the input is masked to a high percentage and fed through an encoder-decoder network to predict the missing information, have shown significant promise recently.29–32 Some methods use symmetrical masking on temporally downsampled video frames to reduce space-time redundancies over a long time period8,33. By contrast, others9 use asymmetrical masking to learn inter-frame correspondence between frame pairs. However, we propose a method for both reducing space-time redundancies over a long time period and learning fine inter-frame correspondence. 2.2.Siamese Natural Image TrackingThese strategies leverage a Siamese architecture for matching between search and target templates, in which the extracted spatial search and template features are matched via feature fusion or a similar matching module.34–40 With the rise of transformers, Siamese trackers have been extended to incorporate transformer-based models, such as Stark41 and Mixformer,42 among other methods cited in Refs. 4344.–45. 2.3.Historical-Trajectory-based Natural Image TrackingThese approaches leverage prompt-based methods to integrate relevant information. In particular, the temporal information is passed into the network as prompts to incorporate the historical trajectory information. ARTrack46 employs a decoder that receives these encodings as well as coordinates of the searched object from previous frames as spatio-temporal prompts for a trajectory proposal. Another approach, SwinTrack,47 uses a multi-head cross-attention decoder that leverages both the encoder output and a motion token, which represents the past object trajectory given previous bounding box predictions. 2.4.Device Tracking in X-RaySpecifically for the tracking of devices in X-Ray images, multiple approaches have been proposed; these include multiple Siamese-based architectures similar to those in natural image object tracking.34,48 Other methods, such as Cycle Ynet,10 employ a semi-supervised approach to address the lack of annotated frames in the medical domain or leverage deep learning-based Bayesian filtering for catheter tip tracking.1 One of the most recent approaches, ConTrack,11 uses a Siamese architecture and a transformer-based feature fusion model. To further refine the tracking, it incorporates a RAFT49 model applied to catheter body masks for estimating the optical flow. 3.MethodsWe propose a novel FIMAE approach to train a transformer model to extract spatio-temporal features based on a large internal dataset . The model is designed specifically to learn inter-frame correspondences over a large number of frames. The pretrained encoder is then used as the backbone for the downstream tracking task using supervised learning on a dataset (with expert annotations). The pretraining method and the tracking pipeline are explained in the following subsections. 3.1.Self-supervised Model Training3.1.1.Learning space-time embeddingsGiven the unlabeled dataset , frames are sampled from an arbitrary sequence , , where . All image frames are randomly cropped to a size of on a sequence level (i.e., the same crop is applied to each image). Each input of size is spatially encoded into tokens of dimension with no temporal downsampling. 3.1.2.Masking strategy based on frame interpolationTo learn features that capture fine spatial information and fine temporal correspondences between frames, we propose a novel masking strategy based on frame interpolation that overcomes the limitation of the symmetrical tube masking proposed by VideoMAE.8 Recall that the VideoMAE approach is limited in capturing fine inter-frame correspondences. Traditionally, in the domain of natural imaging, the frame interpolation task50,51 is defined as the sum of forward warping and backward warping of any two neighboring frames (indexed by ), given as where denotes the forward warping operator and denotes the backward warping operator (parametrized by ). However, the change of appearance in coronary vessel structures in the presence of contrast can be much more complex than natural images. Hence, a linear operation of forward and backward warping can limit the potential of the network. In our case, we reformulate this to a learning problem, seeking to optimize the parameters of a deep neural network to learn a combined warping operation asIn our approach, we use tube masking for every alternate frame with a ratio of 75% and combine it with frame masking. However, with such a high tube masking ratio, further masking an entire intermediate frame for frame interpolation can make the task extremely challenging. In addition, masking an entire frame may also lead the network to never attend to certain patch positions during training. Hence, we mask the intermediate frame randomly to a high ratio of 98%, instead. See Fig. 3 for a schematic visualization. Let be the token indices of the tube masked tokens for frame , where denotes the set of all tube masked token indices. Similarly, refers to the frame masked token indices for frame in all randomly frame masked token indices. Mathematically, if is the probability for masking, , where different time shares the same value. On the other hand, and is drawn uniquely for each frame at . Let and be the sets of remaining visible token indices. Combining tube and frame masking strategies, we obtain the following reconstruction objective for any three given frames: where denotes the index of an arbitrary frame from the sampled sequence and denotes the visible patches of frame with tube/frame masking applied. The three-frame objective shown in Eq. (3) can be generalized to all frames.3.1.3.Encoder-decoder trainingThe unmasked patches are passed through a ViT encoder, which adopts a joint space-time attention, that is, each token for frame , is projected and flattened into -dimensional vector query, key, and value embedding: . The joint space-time attention is based on the concatenated vectors, given as where the variables are defined as , , for sampled consecutive frames. The encoded visible patches are then concatenated with learnable masked tokens. A lightweight transformer decoder attends to the encoded patches and the masked token to reconstruct the initially masked patches. The decoder incorporates additional positional encoding to ensure the correct positions of the masked and unmasked patches as per the original frames.3.1.4.Pretraining loss functionWe use a weighted mean squared error loss, between the masked tokens and the reconstructed ones in the pixel space based on the masking strategy, where is the weighting factor. The losses are calculated as where is the input image, is the reconstructed image, and . We use a weighted loss for reconstruction to compensate for the imbalance between low masked frames (less reconstruction tokens) and highly masked frames (more reconstruction tokens). The variable is defined as the ratio of the number of tokens and the number of tokens.3.2.Downstream Application: Device TrackingIn particular, for tracking the tip of the catheter, our goal is to track its location, at any time given a sequence of X-ray images with a known initial location of the catheter tip on the labeled dataset . We consider the sequences , to have only a few annotated labels, . To identify the location of the tip of the catheter at the current search frame, existing approaches build a correlation with a template frame. The template frame is usually a small crop around the catheter tip location from a previously predicted frame. Similar to ConTrack, during training, we use three template frames that are cropped from the first annotated frame and the previous two annotated frames, respectively. We use the current frame for template frames if no previously annotated frames are available. During inference, the initial location of the catheter tip serves as the first template crop and is kept intact. The remaining two template frames are updated dynamically based on the model’s predictions. 3.2.1.Feature transferThe spatio-temporal transformer backbone inputs three template frames and a search frame as four distinct frames. We interpolate the positional encoding from the pretraining frame positions appropriately to ensure that the network distinguishes each template and search frame as distinct frames. In particular, each template frame and the search frame correspond to the positions of center crops of individual frames in the pretraining setup. Therefore, the encoder inputs , where and are template patches and search patches, respectively. Given that transformers are isotropic models, we obtain an encoded feature set, . The spatio-temporal transformer backbone is trained to extract fine inter-frame correspondences. Hence, this results in a joint feature extraction and feature matching between the template frames and the search frame. The overview of the proposed model is depicted in Fig. 4. 3.2.2.Multi-task transformer decoderWe use a lightweight transformer decoder similar to the original transformer model.52 First, all of the features are projected to a lower dimension . The decoder uses two learnable query tokens , one for a heatmap head and one for a mask head. Then, each layer first computes attention on the query tokens as per Eq. (4). It is followed by cross-attention with encoded features , where key and value embeddings are computed by projecting the features to dimension . The resulting query tokens are then correlated with the search features, unflattened, and passed through a convolutional neural network (CNN) head. The catheter predicted heatmap and mask are given as The final tip coordinates are obtained by , where and refer to the predicted heatmap of the catheter tip and predicted mask of the catheter, respectively. We compute soft dice loss , for both heatmap and mask predictions, given as where represents ground truth labels and is the weight for the weighting mask loss.4.Experiments and Results4.1.DatasetAn unlabeled internal dataset of coronary X-ray sequences is utilized to pretrain our model. consists of 241,362 sequences collected from 21,589 patients, comprising 16,342,992 frames in total. It contains both fluoroscopy (“Fluoro”) and angiography (“Angio”) sequences. We randomly sample 10 frames at a time, with varying temporal gaps between them, ranging from 1 to 4 frames. We repeat the last frame in sequences in which the number of frames is less than 10. The model is then pretrained for 200 epochs with a learning rate of . For the downstream tracking task, we use dataset . Note that . The distribution of the field of view for both and is depicted in Fig. 5 and is estimated based on the positioner angles. The positioner primary angle is defined in the transaxial plane at the imaging device’s isocenter with zero degrees in the direction perpendicular to the patient’s chest, at the patient’s left side, and at the patient’s right side. The positioner secondary angle is defined in the sagittal plane at the imaging device’s isocenter with zero degrees in the direction perpendicular to the patient’s chest. Figure 5 shows that the distribution of the sequences in both datasets are concentrated around similar positioner angles. Other attributes from both datasets and are depicted in Table 1. Table 1Dataset statistics (range and median) for unlabeled dataset (Du) and Catheter tip dataset (Dl).
The annotations on the frames in represent the coordinates of the tip of the catheter, which are converted to Gaussian heatmaps with standard deviations of . Mask annotations of the catheter body are also available for a subset of these annotated frames. On average, the catheter body takes up 0.009% of the total area of a frame. The training and validation set consists of 2314 sequences totaling 198,993 frames, out of which 44,957 have annotations. In this set, 2,098 sequences are Angio and only 216 sequences are Fluoro. The test set consists of 219 sequences, in which all 17,988 frames are annotated. For evaluation, we split the test set into three categories: 94 Fluoro sequences (8494 frames and 82 patients), 101 Angio sequences (6904 frames and 81 patients), and 24 devices sequences (2593 frames and 10 patients).11 The latter category, “devices,” covers all sequences in which sternal wires are present; these cause occlusion and thus further increase the difficulty of catheter tip tracking. Examples of these cases are illustrated in Fig. 6. The signal to noise ratio (SNR) of the image intensity at the catheter tip with respect to the background is shown in Table 2, further quantifying the challenge of tracking. The SNR was calculated based on the following equation: where is the mean intensity in the window of size () and denotes the standard deviation of the intensity of the background in the window of size () with the catheter tip as the center of both windows.Table 2SNR of different categories in the catheter tip dataset (Dl).
We follow the same image pre-processing pipeline as ConTrack, i.e., we resample and pad to the size of with 0.308 mm isotropic pixel spacing. We use crops for the search image and crops for the template images. We train our model for 100 epochs, with a learning rate of using AdamW optimizer and cosine annealing scheduler with warm restarts. 4.2.Performance EvaluationWe evaluate our work against state-of-the-art methods, explore the impact of the proposed pretraining strategy, and investigate whether complex additional tracking refinement modules are necessary. All of the evaluations are performed based on expert annotations. 4.2.1.Benchmarking against state-of-the-artWe report the performance of our model against the state of the art device tracking models in Table 3. Here, we evaluate the euclidean distance error in mm between the prediction and the ground truth annotations. Overall, our method demonstrates the best performance on the test dataset, excelling in both precision and robustness. Our approach significantly reduces the overall maximum error, e.g., by 66.31% against the comparable version of ConTrack (ConTrack-mtmt) and by 23.20% against ConTrack-optim, a highly optimized solution leveraging multi-stage feature fusion, multi-task learning, and flow regularization. In comparison with previous state-of-the-art approaches, our approach results in fewer failures, as depicted by the error distribution in Fig. 7. At least 95% of all test cases has an error below the average diameter of the vessels (). Notably, our approach stands out from other tracking models by eliminating the need for a two-stage process involving the extraction of spatial features and subsequent matching using feature fusion. Instead, our spatio-temporal encoder jointly performs both. Table 3Comparison study of sequence-level tracking errors (mean euclidean distance) and runtime for different methods for catheter tip tracking in coronary X-ray sequences. The best numbers are marked in bold. We also show the performance of different versions of ConTrack. ConTrack-base refers to its base version, which has no additional modules; ConTrack-mtmt refers to multi-task and multi-template version; and ConTrack-optim is its final optimal version, which has all modules including flow refinement.
Other approaches often require two or more forward passes for two-stage processing to incorporate varying the template-search size, which increases computational complexity. This is further amplified by the inclusion of additional modules, such as multi-task decoders and the flow-refinement network in ConTrack-optim.11 By contrast, our model accomplishes the task with a single forward pass for both the multiple templates and the search frame. The only additional modules in our model are the two CNN heads for multi-task decoding. This design choice enables us to achieve a significantly higher real-time inference speed of 42 fps on a single Tesla V100 GPU without compromising on accuracy, as shown in Fig. 1. Despite Cycle Ynet10 also relying on multiple forward passes for feature extraction, its simplicity and computationally friendly CNN architecture allows it to reach a higher speed, albeit at the expense of accuracy and robustness. 4.2.2.Impact of pretrainingNext, we focused on the impact of pretraining by comparing tracking performance utilizing our proposed pretraining strategy (FIMAE) against current prevalent pretraining methods for sequential image processing; see Table 4. The findings indicate that pretraining on domain-specific data, as opposed to natural images (VideoMAE-Kinetics), offers significant advantages. However, even when including the models trained on (VideoMAE and SiamMAE) into the comparison, our model surpasses all by more than 30% across all reported metrics. VideoMAE lacks fine temporal correspondence between frames, leading to non-efficient feature matching between the template and search frames. Although SiamMAE has the ability to learn inter-frame correspondence, it relies on only two frames at a time, which is insufficient for fully capturing the underlying motion. Qualitative results, shown in Fig. 8, are based on a challenging angiography sequence with contrast-based device obstruction and other visible sternal wires. The figure shows how our model is able to handle this challenging case by not losing track of the tip of the catheter, whereas the other models fail to differentiate the catheter from the sternal wires. Table 4Study of effect of pretraining startegies on the performance of the catheter tip tracking. Pretraining is performed either on our internal dataset (denoted as Du) or on natural images (in case of the first approach). The best values are marked in bold.
4.2.3.Performance without complexityThe strength of our approach comes from the pretrained spatio-temporal features that facilitate effective feature matching between the template frames and the search frame. Another key advantage is its prior understanding of the inherent cardiac/respiratory motion. This knowledge significantly reduces or even eliminates the impact of additional modules, such as flow refinement. Our approach thereby achieves high robustness in tracking, with minimal variations across different additional modules, such as multi-task. To illustrate this, Fig. 9(a) highlights the relative stability of the maximum error across different versions of our model compared with the high volatility observed in ConTrack under different module configurations. In addition, ConTrack reaches its best performance only when utilizing all modules, in particular, including flow-refinement, which in turn leads to increased inference time. Contrary to ConTrack, adding the flow refinement module to our model even reduced its performance marginally in terms of accuracy (1.54 mm) and robustness (max error of 11.38 mm). We postulate that this is attributable to the fact that, although flow refinement can indeed learn intricate temporal correspondences between the previous and current frames, it can also propagate noise originating from inaccurately predicted catheter masks. To further assess the robustness of the tracking systems, we introduce the tracking success score (TSUC), which draws parallels with most tracking benchmarks prevalent in single object tracking in the natural image domain.53 TSUC is computed as the ratio of the number of instances (frame or sequence) in which the distance error falls below a specific threshold to the total number of instances. To establish a relevant threshold, we set it at twice the average vessel diameter in our test dataset (). Figures 9(b) and 9(c) summarize the results for sequence-level and frame-level TSUC, respectively. Our approach consistently achieves an impressive 99.08% sequence-level TSUC across all additional modules, with only a small drop to 98.61% in the multi-task configuration. At the frame level, our optimal version (multi-task multi-template) yields a TSUC of 97.95%, compared with 93.53% for ConTrack under the same configuration. ConTrack achieves its best frame-level TSUC of 95.44% using the flow-refinement variant. The robustness of a method is also influenced by its ability to effectively handle long sequences as the accuracy of current frame predictions is dependent on previous frame predictions, resulting in a gradual accumulation of errors over time. We examine the mean TSUC for sequences exceeding a certain frame count (nframes) in Fig. 10. The plot shows that our method consistently demonstrates stable TSUC values across various sequence lengths, indicating its robust performance. Conversely, different versions of the ConTrack exhibit a gradual decline in mean TSUC as the frame count threshold increases, suggesting a reduced reliability in predicting outcomes over extended sequences. 4.2.4.Performance breakdown for different casesWe further conduct detailed comparison with the best-performing state-of-the-art method, ConTrack, for the different image categories defined earlier; see Fig. 11. We further compare our model’s performance with ConTrack for the challenging cases, i.e., angiography and devices, via percentile plots in Fig. 12. In the cases of angiography, our method shows a 15% improved accuracy and 45% reduction in the maximum error. Similarly, for the devices (occlusion) category, we achieve a 43% better accuracy and 60% reduction in the maximum error (Figs. 11 and 12). Our model’s performance on Angio and devices cases is compared qualitatively with ConTrack in Fig. 13. The example cases in the figure show the effectiveness of our approach in the presence of complex occlusions from the vessels and sternal wires. ConTrack achieves a better performance than our method in Fluoro cases with a slightly better median and lesser maximum error. However, for Fluoro, ConTrack achieves a TSUC of 99.01% (inaccurate in one sequence) compared with our model’s TSUC of 97.69% (inaccurate in three sequences). The inaccuracy of our model is seen in sequences in which the visibility of the catheter is faint due to low-dose X-rays. We hypothesise that this is due to the transformer’s architecture using non-overlapping patches, which makes it less effective toward faint visibility in low-dose X-rays compared with CNNs in ConTrack, which uses overlapping windows. 4.3.AblationsThe following ablation studies investigate the impact of three key components on the overall tracking performance. 4.3.1.Positional encodingAs reported in Table 5, the positional encoding strategy has a notable impact on the downstream task performance. The naive positional encoding simply applies sine-cosine positional encoding over all patches and hence loses the temporal information about the patches, resulting in unsatisfactory results. If learnable positional encoding is used, the temporal positions are still needed to be learned, leading to sub-optimal performance. Interpolating from the central patch positions of the pretrained frames (frame-aware positional encoding) gives the best results. Table 5Effect of different positional encoding incorporated in the downstream task. The best values are marked in bold.
4.3.2.Masking ratioWe further compare the performance of different intermediate frame masking ratios in Table 6. The best results are obtained with an intermediate frame masking ratio of 98%. Although results with 95% are largely equivalent, there is a notable reduction in performance when the entire frame is masked, which may be due to the lack of patches and its relative position information during pretraining. Table 6Tracking performance with FIMAE trained with different intermediate frame masking ratios, i.e., masking ratio of Ωframe. The best values are marked in bold.
4.3.3.Effect of initializationRecall that the first template crop during both training and inference was obtained from the initial catheter tip location and was not updated. We explore its impact in Table 7. To assess its importance, we conduct two experiments. First, we dynamically update the initial template frame during inference, as with the others. Second, we introduce random noise (2 to 16 pixels) to the initial tip location instead of updating the template. Our findings highlight the crucial role of initialization in tracking. Updating the initial template frame worsens performance due to greater accumulated prediction errors over time compared with the original setup. Additionally, even small noise levels of 2 pixels can noticeably affect performance, increasing the maximum error by 5 pixels. Table 7Significance of initialization in catheter tip tracking: how the performance is affected if first template frame is updated or some noise is introduced to the initial tip coordinates. The best values are marked in bold.
4.3.4.Modality biasThe distribution between Angio and Fluoro varies to some degree in terms of dosage and presence of contrasted vessel structures. We remind the reader that, in our training dataset, the distribution of Angio:Fluoro sequences was 2098:216 of the total of 2314 sequences. Our objective in this study is to develop a model that exhibits strong performance across both modalities. We present the results of training on individual modalities compared with training on combined data in Table 8. Our findings indicate that training solely on one modality results in suboptimal performance on the other modality. Notably, although training on Angio data yields an improvement in Angio performance, training exclusively on Fluoro data fails to enhance performance in Fluoro. We hypothesize that a possible reason for this effect is the imbalance of 2098:216 (Angio to Fluoro sequences), with the following effects.
Table 8Performance variation across modalities based on modality-specific training. The best values are marked in bold.
Furthermore, the challenges posed by device obstruction exhibit nuanced differences between Fluoro and Angio, contributing to a reduced performance when the model is trained on a single modality. 5.ConclusionIn this study, we presented FIMAE, an MIM approach that is introduced for the purpose of acquiring generalized features from a large unlabeled dataset containing more than 16 million interventional X-ray frames, with the objective of device tracking. FIMAE overcomes the limitation of tube masking as proposed in VideoMAE and applies frame interpolation-based masking for capturing fine inter-frame correspondences. The acquired features are subsequently applied to the task of device tracking within fluoroscopy and angiography image sequences. Our pre-trained FIMAE encoder surpassed all prevalent MIM-based pretraining methods for sequential imaging processing. The spatio-temporal features acquired during the pretraining phase significantly influenced the extraction and matching of features for the purpose of device tracking. We demonstrated that an efficient spatio-temporal encoder can replace the frequently utilized Siamese-like architecture, yielding a computationally lightweight model that maintains a high degree of precision and robustness in the tracking task. By adopting our methodology, we achieved a noteworthy 23.2% reduction in the maximum tracking error, even without the incorporation of supplementary modules such as flow refinement, when compared with the state-of-the-art multi-modular optimized approach. This performance enhancement was accompanied by a frame-level TSUC score of 97.95% at a faster inference speed than the state-of-the-art method. The results also show that our approach achieved superior tracking performance, particularly in the challenging cases in which occlusions and distractors are present. 5.1.Limitations and Future WorkOur investigation is primarily centered on leveraging pre-trained features for the tracking of devices within X-ray sequences. Consequently, we contend that the pre-trained model can be further extended to other tasks within interventional image analytics, such as stenosis detection, guidewire localization, and vessel segmentation. Furthermore, the absence of annotated frames within our sequential imaging dataset imposes a constraint on the utilization of historical trajectory information, a commonly exploited approach in recent single object tracking methodologies in the natural imaging domain. Thus, a more comprehensive investigation is needed to effectively make use of this information in our specific context. 6.Appendix A: Pretraining DetailsThe detailed architecture illustration and the implementation details of the pretraining are illustrated in Tables 9 and 10, respectively. We use a 10-frame vanilla ViT-Base as our encoder architecture; it incorporates joint space-time attention on visible patches. The decoder is of a lower dimension and lower depth than the encoder, which incorporates similar joint space-time attention on all patches. The decoder is only responsible for reconstruction and is discarded for downstream tasks. Table 9Architecture details of FIMAE. We use a 10-frame vanilla ViT-Base as our architecture. “MHA” here denotes the joint space-time self-attention. The output sizes are denoted by C×T×S for channel, temporal and spatial sizes, respectively.
Table 10Pretraining setting.
7.Appendix B: Downstream Model DetailsThe architectural detail of the downstream tracking model is depicted in Table 11. The encoder is the same as the pretraining encoder, whereas the decoder is a lightweight transformer decoder, followed by two CNN heads that output the catheter tip heatmap and catheter body mask respectively. The implementation details are further explained in Table 12. Table 11Architecture details of downstream tracking model. “CA” refers to cross-attention.
Table 12Finetuning setting.
Code and Data AvailabilityBased on the data usage agreements, the data cannot be shared with the community. More information about the code can be shared upon request. DisclaimerThe concepts and information presented in this paper are based on research results that are not commercially available. ReferencesH. Ma et al.,
“Dynamic coronary roadmapping via catheter tip tracking in x-ray fluoroscopy with deep learning based Bayesian filtering,”
Med. Image Anal., 61 101634 https://doi.org/10.1016/j.media.2020.101634
(2020).
Google Scholar
K. E. Odening et al.,
“Esc working group on cardiac cellular electrophysiology position paper: relevance, opportunities, and limitations of experimental models for cardiac electrophysiology research,”
EP Europace, 23
(11), 1795
–1814 https://doi.org/10.1093/europace/euab142
(2021).
Google Scholar
A. Facciorusso et al.,
“Transarterial chemoembolization: evidences from the literature and applications in hepatocellular carcinoma patients,”
World J. Hepatol., 7
(16), 2009 https://doi.org/10.4254/wjh.v7.i16.2009
(2015).
Google Scholar
K. Piayda et al.,
“Dynamic coronary roadmapping during percutaneous coronary intervention: a feasibility study,”
Eur. J. Med. Res., 23 1
–7 https://doi.org/10.1186/s40001-018-0333-x EJMRDJ 0379-8399
(2018).
Google Scholar
P. Wang et al.,
“Image-based device tracking for the co-registration of angiography and intravascular ultrasound images,”
Lect. Notes Comput. Sci., 6891 161
–168 https://doi.org/10.1007/978-3-642-23623-5_21 LNCSD9 0302-9743
(2011).
Google Scholar
T. Araki et al.,
“A comparative approach of four different image registration techniques for quantitative assessment of coronary artery calcium lesions using intravascular ultrasound,”
Comput. Methods Programs Biomed., 118
(2), 158
–172 https://doi.org/10.1016/j.cmpb.2014.11.006 CMPBEK 0169-2607
(2015).
Google Scholar
P. Wang et al.,
“Image-based co-registration of angiography and intravascular ultrasound images,”
IEEE Trans. Med. Imaging, 32
(12), 2238
–2249 https://doi.org/10.1109/TMI.2013.2279754 ITMID4 0278-0062
(2013).
Google Scholar
Z. Tong et al.,
“VideoMAE: masked autoencoders are data-efficient learners for self-supervised video pre-training,”
in Adv. in Neural Inf. Process. Syst.,
10078
–10093
(2022). Google Scholar
A. Gupta et al.,
“Siamese masked autoencoders,”
(2023). Google Scholar
J. Lin et al.,
“Cycle Ynet: semi-supervised tracking of 3D anatomical landmarks,”
Lect. Notes Comput. Sci., 12436 593
–602 https://doi.org/10.1007/978-3-030-59861-7_60 LNCSD9 0302-9743
(2020).
Google Scholar
M. Demoustier et al.,
“Contrack: contextual transformer for device tracking in x-ray,”
(2023). Google Scholar
A. Dosovitskiy et al.,
“An image is worth 16x16 words: transformers for image recognition at scale,”
(2020). Google Scholar
C. Doersch, A. Gupta and A. A. Efros,
“Unsupervised visual representation learning by context prediction,”
in Proc. IEEE Int. Conf. Comput. Vis.,
1422
–1430
(2015). https://doi.org/10.1109/ICCV.2015.167 Google Scholar
D. Pathak et al.,
“Learning features by watching objects move,”
in Proc. IEEE Conf. Comput. Vis. and Pattern Recognit.,
2701
–2710
(2017). https://doi.org/10.1109/CVPR.2017.638 Google Scholar
S. Gidaris, P. Singh and N. Komodakis,
“Unsupervised representation learning by predicting image rotations,”
(2018). Google Scholar
Z. Wu et al.,
“Unsupervised feature learning via non-parametric instance discrimination,”
in Proc. IEEE Conf. Comput. Vis. and Pattern Recognit.,
3733
–3742
(2018). https://doi.org/10.1109/CVPR.2018.00393 Google Scholar
K. He et al.,
“Momentum contrast for unsupervised visual representation learning,”
in Proc. IEEE/CVF Conf. on Comput. Vis. and Pattern Recognit.,
9729
–9738
(2020). https://doi.org/10.1109/CVPR42600.2020.00975 Google Scholar
M. Caron et al.,
“Emerging properties in self-supervised vision transformers,”
in CVF Int. Conf. Comput. Vis. (ICCV),
9630
–9640
(2021). https://doi.org/10.1109/ICCV48922.2021.00951 Google Scholar
T. Chen et al.,
“A simple framework for contrastive learning of visual representations,”
in Int. Conf. Mach. Learn.,
1597
–1607
(2020). Google Scholar
J.-B. Grill et al.,
“Bootstrap your own latent-a new approach to self-supervised learning,”
in Adv. in Neural Inf. Process. Syst.,
21271
–21284
(2020). Google Scholar
X. Chen and K. He,
“Exploring simple Siamese representation learning,”
in Proc. IEEE/CVF Conf. on Comput. Vis. and Pattern Recognit.,
15750
–15758
(2021). https://doi.org/10.1109/CVPR46437.2021.01549 Google Scholar
P. Sermanet et al.,
“Time-contrastive networks: Self-supervised learning from video,”
in IEEE Int. Conf. Rob. and Autom. (ICRA),
1134
–1141
(2018). https://doi.org/10.1109/CVPRW.2017.69 Google Scholar
C. Sun et al.,
“Learning video representations using contrastive bidirectional transformer,”
(2019). Google Scholar
T. Han, W. Xie and A. Zisserman,
“Video representation learning by dense predictive coding,”
in Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshops,
(2019). https://doi.org/10.1109/ICCVW.2019.00186 Google Scholar
C. Feichtenhofer et al.,
“A large-scale study on unsupervised spatiotemporal representation learning,”
in Proc. IEEE/CVF Conf. on Comput. Vis. and Pattern Recognit.,
3299
–3309
(2021). https://doi.org/10.1109/CVPR46437.2021.00331 Google Scholar
A. Recasens et al.,
“Broaden your views for self-supervised video learning,”
in Proc. IEEE/CVF Int. Conf. Comput. Vis.,
1255
–1265
(2021). https://doi.org/10.1109/ICCV48922.2021.00129 Google Scholar
R. Qian et al.,
“Spatiotemporal contrastive video representation learning,”
in Proc. IEEE/CVF Conf. on Comput. Vis. and Pattern Recognit.,
6964
–6974
(2021). https://doi.org/10.1109/CVPR46437.2021.00689 Google Scholar
N. Park et al.,
“What do self-supervised vision transformers learn?,”
(2023). Google Scholar
J. Devlin et al.,
“Bert: pre-training of deep bidirectional transformers for language understanding,”
(2018). Google Scholar
K. He et al.,
“Masked autoencoders are scalable vision learners,”
in Proc. IEEE/CVF Conf. on Comput. Vis. and Pattern Recognit.,
16000
–16009
(2022). https://doi.org/10.1109/CVPR52688.2022.01553 Google Scholar
H. Bao et al.,
“Beit: Bert pre-training of image transformers,”
(2021). Google Scholar
Z. Xie et al.,
“SimMIM: a simple framework for masked image modeling,”
in Proc. IEEE/CVF Conf. on Comput. Vis. and Pattern Recognit.,
9653
–9663
(2022). https://doi.org/10.1109/CVPR52688.2022.00943 Google Scholar
C. Feichtenhofer et al.,
“Masked autoencoders as spatiotemporal learners,”
in Adv. in Neural Inf. Process. Syst.,
35946
–35958
(2022). Google Scholar
B. Li et al.,
“High performance visual tracking with siamese region proposal network,”
in Proc. IEEE Conf. Comput. Vis. and Pattern Recognit.,
8971
–8980
(2018). https://doi.org/10.1109/CVPR.2018.00935 Google Scholar
B. Li et al.,
“Evolution of Siamese visual tracking with very deep networks,”
in Proc. IEEE/CVF Conf. on Comput. Vis. and Pattern Recognit.,
15
–20
(2019). https://doi.org/10.1109/CVPR.2019.00441 Google Scholar
H. Fan and H. Ling,
“Siamese cascaded region proposal networks for real-time visual tracking,”
in Proc. IEEE/CVF Conf. on Comput. Vis. and Pattern Recognit.,
7952
–7961
(2019). https://doi.org/10.1109/CVPR.2019.00814 Google Scholar
Z. Zhu et al.,
“Distractor-aware Siamese networks for visual object tracking,”
Lect. Notes Comput. Sci., 11213 101
–117 https://doi.org/10.1007/978-3-030-01240-3_7 LNCSD9 0302-9743
(2018).
Google Scholar
H. Fan and H. Ling,
“Cract: cascaded regression-align-classification for robust tracking,”
in IEEE/RSJ Int. Conf. Intell. Rob. and Syst. (IROS),
7013
–7020
(2021). https://doi.org/10.1109/IROS51168.2021.9636803 Google Scholar
Y. Yu et al.,
“Deformable Siamese attention networks for visual object tracking,”
in Proc. IEEE/CVF Conf. on Comput. Vis. and Pattern Recognit.,
6728
–6737
(2020). https://doi.org/10.1109/CVPR42600.2020.00676 Google Scholar
Z. Zhang et al.,
“Learn to match: automatic matching network design for visual tracking,”
in Proc. IEEE/CVF Int. Conf. Comput. Vis.,
13339
–13348
(2021). https://doi.org/10.1109/ICCV48922.2021.01309 Google Scholar
B. Yan et al.,
“Learning spatio-temporal transformer for visual tracking,”
in Proc. IEEE/CVF Int. Conf. Comput. Vis.,
10448
–10457
(2021). https://doi.org/10.1109/ICCV48922.2021.01028 Google Scholar
Y. Cui et al.,
“Mixformer: end-to-end tracking with iterative mixed attention,”
in Proc. IEEE/CVF Conf. on Comput. Vis. and Pattern Recognit.,
13608
–13618
(2022). Google Scholar
J. Kugarajeevan et al.,
“Transformers in single object tracking: an experimental survey,”
IEEE Access, 11 80297
–80326 https://doi.org/10.1109/ACCESS.2023.3298440
(2023).
Google Scholar
N. Wang et al.,
“Transformer meets tracker: exploiting temporal context for robust visual tracking,”
in Proc. IEEE/CVF Conf. on Comput. Vis. and Pattern Recognit.,
1571
–1580
(2021). https://doi.org/10.1109/CVPR46437.2021.00162 Google Scholar
X. Chen et al.,
“Transformer tracking,”
in Proc. IEEE/CVF Conf. on Comput. Vis. and Pattern Recognit.,
8126
–8135
(2021). https://doi.org/10.1109/CVPR46437.2021.00803 Google Scholar
X. Wei et al.,
“Autoregressive visual tracking,”
in Proc. IEEE/CVF Conf. on Comput. Vis. and Pattern Recognit.,
9697
–9706
(2023). https://doi.org/10.1109/CVPR52729.2023.00935 Google Scholar
L. Lin et al.,
“Swintrack: a simple and strong baseline for transformer tracking,”
in Adv. in Neural Inf. Process. Syst.,
16743
–16754
(2022). Google Scholar
J. Bromley et al.,
“Signature verification using a “siamese” time delay neural network,”
in Adv. in Neural Inf. Process. Syst.,
(1993). Google Scholar
Z. Teed and J. Deng,
“Raft: recurrent all-pairs field transforms for optical flow,”
Lect. Notes Comput. Sci., 12347 402
–419 https://doi.org/10.1007/978-3-030-58536-5_24 LNCSD9 0302-9743
(2020).
Google Scholar
H. Jiang et al.,
“Super slomo: high quality estimation of multiple intermediate frames for video interpolation,”
in Proc. IEEE Conf. Comput. Vis. and Pattern Recognit. (CVPR),
(2018). https://doi.org/10.1109/CVPR.2018.00938 Google Scholar
S. Niklaus, L. Mai and F. Liu,
“Video frame interpolation via adaptive separable convolution,”
in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
(2017). https://doi.org/10.1109/ICCV.2017.37 Google Scholar
A. Vaswani et al.,
“Attention is all you need,”
in Adv. in Neural Inf. Process. Syst.,
(2017). Google Scholar
H. Fan et al.,
“LaSOT: a high-quality benchmark for large-scale single object tracking,”
in Proc. IEEE/CVF Conf. on Comput. Vis. and Pattern Recognit.,
5374
–5383
(2019). https://doi.org/10.1109/CVPR.2019.00552 Google Scholar
BiographySaahil Islam is a second-year PhD student at Friedrich Alexander University in Erlangen, Germany. Having obtained an MSc degree from the same institution with a specialization in computer vision and segmentation for glacier calving front detection from remote sensing images, his current research focuses on medical imaging within the realm of image-guided therapy, conducted in collaboration with Siemens Healthineers. Specifically, he is dedicated to leveraging artificial intelligence to enhance real-time systems in image-guided therapy, aiming to contribute to advancements in this critical field. Venkatesh N. Murthy, a seasoned computer scientist, brings over a decade of experience in computer vision and machine learning. Having earned his PhD from UMass Amherst, he has garnered acclaim through numerous publications in prestigious conferences and journals, amassing over 700 citations and securing multiple patents. Currently, he is a staff research scientist at Siemens Healthineers in Princeton, New Jersey, United States. He focuses on advancing object classification, detection, and tracking technologies, driving innovation in healthcare technology. |
Image analysis
Education and training
Analytics
X-rays
Angiography
Performance modeling
Transformers