|
I.INTRODUCTIONIN recent years, deep learning methods have been employed for many problems in medical image formation, including image-based and projection-based noise reduction, image reconstruction, scatter estimation, and artifact reduction. While the results of deep neural-network (DNN) based methods often excel those of conventional algorithms both qualitatively and quantitatively, they lack interpretability due to most DNNs being black boxes. Particularly for low dose CT imaging, recent advancements in generative methods such as generative adversarial networks (GANs) [1] and variational autoencoders (VAEs) [2] demonstrated impressive performance, providing competitive image quality compared to commercial iterative reconstruction techniques [3]. In this work, instead of focusing on the actual denoising performance of DNN-based methods for CT imaging, we want to lay the fundamentals for a post-hoc analysis of such networks in terms of their interpretability and robustness. To this end, we investigate what they have learned to represent and to ignore (i.e. their invariances) at different layers and argue that robust and non-robust denoising networks are invariant to different input features. Note, that this type of analysis is not restricted to CT and similar methods can be applied to denoising networks for other imaging modalities (e.g. magnetic resonance imaging or positron emission tomography). II.BACKGROUNDA.CT Image Denoising with DNNsIn this work we assume to have high-dose images y ∈ ℝm×n as well as low dose images x ∈ ℝm×n during training time. The aim of any deep-learning based denoising method is then to find a function f(· ; θ) with parameters θ, such that where f is realised by a DNN. In recent years most improvements on finding an optimal f focused on alterations of the architecture and training scheme. While earlier work utilized pixelwise losses (in image or feature space) which lead to smooth predictions and lack high-frequency information [4, 6], many recent methods are being trained as GANs, leading to extremely realistic denoising results [3, 5]. B.Invariances of DNNsOur work is based on reference [7], where the authors seek to reconstruct and interpret the invariances of image classification DNNs using invertible neural networks (INNs). Given a network f(x) we can analyze any internal latent representation z thereof by decomposing f into f(x) = Ψ(z) = Ψ(Ф(x)). To then explain z we need to know what information of the input x is captured in z and to what information Ф is invariant to (and is thus missing in z). To this end, the authors of [7] employ a VAE comprised of an encoder E and a decoder D that is trained to learn a complete data representation Since the complete data representation Here, it is assumed that invariances v can be sampled from a Gaussian distribution, i.e. p(v) = 𝒩(v|0,1), and the mapping t is realized through a normalizing flow [8–10], a sequence of INNs between the simple (normal) distribution p(v) and the complex distribution Since t is invertible, we can generate new To visualize III.METHODSA.DatasetFor all our studies the Low Dose CT Image and Projection dataset [11] is employed. The dataset comprises 50 head scans, 50 chest scans, and 50 abdomen scans acquired at routine dose levels with a SOMATOM Definition Flash (Siemens Healthineers, Forchheim, Germany) CT scanner. Additionally the dataset provides simulated low dose reconstructions (at 25% dose for abdomen/head and at 10% dose for chest scans) which were used as input to the denoising networks. We split the dataset into 70%/20%/10% for training/validation/test across all patients and trained with a weighted sampling scheme such that slices from each patient were sampled with equal probability. To make results between different methods comparable we trained and validated all denoising networks as well as the invariance reconstruction method on the same training/validation split of our data. B.Denoising MethodsWhile our method can be used to provide post-hoc invariance analysis for any (trained) DNN-based denoising method, for simplicity, we here focus on interpreting the invariances of two well-known denoising methods: Chen et al. [4] proposed a simple three-layer convolutional neural network which was trained to minimize (1) using an L2 loss. The authors trained their network on patches of size 33 × 33 using an SGD optimizer and showed that their method can outperform conventional state-of the art methods. Yang et al. [5] improved on previous works by training a Wasserstein GAN (WGAN) [12] in combination with a perceptual loss [13] in feature space. Furthermore, they utilize a deeper generator compared to [4] and train the network on larger patches of size 64 × 64. We trained both [4] and [5] on the dataset described in Sec. III-A using the hyperparameters as described in the original papers. Whenever hyperparameters were not stated by the authors, we ran a grid-search and used the parameters that result in the lowest validation loss. C.Recovering InvariancesSimilar to reference [7] we first learn a complete data representation For both of the two denoising networks evaluated, we train three conditional INNs (cINNs) to learn to reconstruct invariances at three different layers in the network. For Chen et al. [4] we do so at layer 1, 3, and 5 and for Yang et al. [5] at layer 1, 7, and 13 (refer Tab. I). Each of the cINNs, t is composed of four invertible blocks, where each block is composed of coupling blocks [16], actnorm layers [17], and shuffling layers. For each invertible block, the conditioning on the denoising network representation z is realized by concatenating an embedding h = H(z), where H is a shallow network, with the input to the respective block. TABLE I:Overview of generator architectures used in Chen et al. [4] and Yang et al. [5]. Kernel sizes of the 2D convolutions are indicated by k and their number of filters by f. Final nonlinearities of the original architectures were omitted to accommodate for the normalization of our data.
For each network and layer we then reconstruct different samples of the invariances IV.RESULTSA.Denoising MethodsWe find that the results from both denoising networks are similar to those reported in the respective original papers (Fig. 1). Due to the L2 loss in image space the results from [4] appear smooth and lack structural fidelity. This is alleviated by training with an adversarial loss and consequently our results for [5] look much more realistic with higher details and noise structures very similar to those present in the high dose images. Fig. 1:Denoising performance of Chen et al. [4] and Yang et al. [5] for six different dataset samples (columns). Blue arrows indicate regions where the networks produced errors in the reconstruction of anatomical details. ![]() However, we find that both methods are unable to correctly reconstruct anatomical details in several cases (refer Fig. 1, blue arrows). This is particularly problematic when the network is trained in an adversarial setting, where those false anatomies can look very convincing to the radiologist. B.Reconstructed InvariancesThe reconstructed invariances for both networks and two different samples (ref. Sec. III-C) are provided in Fig. 2. For each sample we also show the low dose input image x, the high dose ground truth image y, the reconstruction of the complete data representation Fig. 2:Best viewed in color. Analysis of Chen et al. [4], (a) & (b), and Yang et al. [5], (i) & (ii). Provided are low dose input image x, high dose ground truth image y, VAE network reconstruction ![]() From this we find that both denoising methods are invariant to several anatomical features to some extent (Fig. 2; blue arrows). We also find a higher overall variance of the invariances in homogeneous regions of the image for [4], indicating that it is more invariant to the specific realization of noise in the low dose input image. However, when inspecting the VAE reconstructions V.CONCLUSIONIn this work we analyzed deep neural networks for CT image denoising regarding their invariances to anatomical features in the low dose image domain. To reconstruct those invariances we adapted a method from prior work on interpretable AI and sampled reconstructions of invariances for two CT denoising networks. Upon analysis of the reconstructed invariances, we find that the representations of both networks at different layers are invariant to several anatomical features. While this work demonstrated the potential of an invariance-based analysis of DNNs for CT image denoising, the ability to interpret those invariances is currently limited due to reconstruction errors from the embedding ACKNOWLEDGMENTThis work was supported in part by the Helmholtz International Graduate School for Cancer Research, Heidelberg, Germany. REFERENCESI. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio,
“Generative adversarial nets,”
NeurIPS, 2 2672
–2680
(2014). Google Scholar
D. P. Kingma and M. Welling,
“Auto-encoding variational Bayes,”
ICLR,
(2014). Google Scholar
H. Shan, A. Padole, F. Homayounieh, U. Kruger, R. D. Khera, C. Nitiwarangkul, M. K. Kalra, and G. Wang,
“Competitive performance of a modularized deep neural network compared to commercial algorithms for low-dose CT image reconstruction,”
Nature Machine Intelligence, 1
(6), 269
–276
(2019). https://doi.org/10.1038/s42256-019-0057-9 Google Scholar
H. Chen, Y. Zhang, W. Zhang, P. Liao, K. Li, J. Zhou, and G. Wang,
“Low-dose CT denoising with convolutional neural network,”
in International Symposium on Biomedical Imaging (ISBI),
143
–146
(2017). Google Scholar
Q. Yang, P. Yan, Y. Zhang, H. Yu, Y. Shi, X. Mou, M. K. Kalra, Y. Zhang, L. Sun, and G. Wang,
“Low-dose CT image denoising using a generative adversarial network with wasserstein distance and perceptual loss,”
IEEE TMI, 37
(6), 1348
–135
(2018). Google Scholar
H. Chen, Y. Zhang, M. K. Kalra, F. Lin, Y. Chen, P. Liao, J. Zhou, and G. Wang,
“Low-dose ct with a residual encoder-decoder convolutional neural network,”
IEEE TMI, 36
(12), 2524
–2535
(2017). Google Scholar
R. Rombach, P. Esser, and B. Ommer,
“Making sense of CNNs: Interpreting deep representations & their invariances with INNs,”
ECCV,
(2020). Google Scholar
D. J. Rezende and S. Mohamed,
“Variational inference with normalizing flows,”
ICML, 1530
–1538
(2015). Google Scholar
L. Dinh, D. Krueger, and Y. Bengio,
“NICE: non-linear independent components estimation,”
ICLR,
(2015). Google Scholar
J. S.-D. Dinh, Laurent and S. Bengio,
“Density estimation using real NVP,”
ICLR,
(2017). Google Scholar
C. McCollough, B. Chen, D. Holmes, X. Duan, Z. Yu, L. Yu, S. Leng, and J. Fletcher,
“Data from low dose ct image and projection data [data set],”
The Cancer Imaging Archive,
(2020). Google Scholar
M. Arjovsky, S. Chintala, and L. Bottou,
“Wasserstein generative adversarial networks,”
ICML, 15 214
–223
(2017). Google Scholar
J. Johnson, A. Alahi, and L. Fei-Fei,
“Perceptual losses for real-time style transfer and super-resolution,”
ECCV, 214
–223
(2016). Google Scholar
K. He, X. Zhang, S. Ren, and J. Sun,
“Deep residual learning for image recognition,”
CVPR, 770
–778
(2016). Google Scholar
A. Brock, J. Donahue, and K. Simonyan,
“Large scale GAN training for high fidelity natural image synthesis,”
ICLR,
(2019). Google Scholar
L. Ardizzone, J. Kruse, S. Wirkert, D. Rahner, E. W. Pellegrini, R. S. Klessen, L. Maier-Hein, C. Rother, and U. Köthe,
“Analyzing inverse problems with invertible neural networks,”
ICLR,
(2019). Google Scholar
D. P. Kingma and P. Dhariwal,
“Glow: Generative flow with invertible 1x1 convolutions,”
NeurIPS, 31
(2018). Google Scholar
|