# Neural Network-Assisted Nonlinear Multiview Component Analysis: Identifiability and Algorithm

###### Abstract

Multiview analysis aims at extracting shared latent components from data samples that are acquired in different domains, e.g., image, text, and audio. Classic multiview analysis, e.g., canonical correlation analysis (CCA), tackles this problem via matching the linearly transformed views in a certain latent domain. More recently, powerful nonlinear learning tools such as kernel methods and neural networks are utilized for enhancing the classic CCA. However, unlike linear CCA whose theoretical aspects are clearly understood, nonlinear CCA approaches are largely intuition-driven. In particular, it is unclear under what conditions the shared latent components across the veiws can be identified—while identifiability plays an essential role in many applications. In this work, we revisit nonlinear multiview analysis and address both the theoretical and computational aspects. We take a nonlinear multiview mixture learning viewpoint, which is a natural extension of the classic generative models for linear CCA. From there, we derive a nonlinear multiview analysis criteron. We show that minimizing this criterion leads to identification of the latent shared components up to certain ambiguities, under reasonable conditions. Our derivation and formulation also offer new insights and interpretations to existing deep neural network-based CCA formulations. On the computation side, we propose an effective algorithm with simple and scalable update rules. A series of simulations and real-data experiments corroborate our theoretical analysis.

## I Introduction

Multiview analysis has been an indispensable tool in statistical signal processing, machine learning, and data analytics. In the context of multiview learning, a view can be understood as measurements of data entities (e.g., a cat) in a certain domain (e.g., text, image, and audio). Most data entities naturally appear in different domains. Multiview analysis aims at extracting essential and common information from different views. Compared with single-view analysis tools like principal component analysis (PCA), independent component analysis (ICA) [7], and nonnegative matrix factorization (NMF) [14], multiview analysis tools such as canonical correlation analysis (CCA) [18] have an array of unique features. For example, CCA has been shown to be more robust to noise and view-specific strong interference [4, 22].

The classic CCA has been extensively studied in the literature, ever since its proposal in statistics in the 1930s [19, 18]. In a nutshell, the classic CCA seeks linear transformations for the views. The transformations are supposed to ‘project’ the views to a domain where the views share similar representations. Interestingly, the formulated optimization problem, although being nonconvex, can be recast into a generalized eigendecomposition problem and solved efficiently [18, 40]. In recent years, many attempts have been made towards scaling up classic CCA to handle big datasets; see, e.g., [40]. Beyond the classic two-view CCA, a series of generalized CCA (GCCA) formulations for handling more views exist, whose scalable versions have also been considered [13, 23].

Linear transformation-based CCA/GCCA algorithms are elegant in computation. However, from a modeling viewpoint, restricting the transformations to be linear makes the ‘modeling power’ limited. For decades, a lot of effort has been invested to extending the CCA/GCCA ideas to the nonlinear regime—via incorporating a variety of nonlinear transformations. For example, kernel CCA has been popular since the 2000s [17, 3]. More recently, together with the success of deep learning, deep neural networks are also used to enhance CCA for unsupervised representation learning [39, 2]. Compared to kernel methods, deep neural networks are considered more flexible and more scalable.

Kernel CCA and deep CCA have demonstrated effectiveness in many real-world applications, e.g., image representation learning [39] and speech processing [38]. This is encouraging—it shows that incorporating nonlinearity in data analytics is indeed well-motivated. On the other hand, it is still largely unclear under what conditions these methods will work (fail) or how to improve existing schemes with theoretical supports. In fact, unlike linear CCA whose generative models, parameter identifiability issues, and computational aspects are fairly well understood, nonlinear CCA formulations are largely intuition-driven.

In this work, we revisit the nonlinear multiview analysis problem, and offer both theoretical understanding and theory-backed implementation. Our work is motivated by a recent work in linear CCA [22], where different views are modeled as mixtures of shared components and view-specific components (which are interference). This model is similar to the one considered in machine learning in the context of probabilistic CCA [4]. Under this model, interesting interpretation for the effectiveness of CCA can be obtained. Specifically, the work in [22] shows that the classic linear CCA can extract the shared components’ range space—even if the interference terms are much stronger than the shared components. The model is fairly simple and succinct, yet the insight is significant: It explains the robustness of linear multiview analysis and gives clear scenarios under which CCA is preferred over the single-view analysis tools such as PCA.

Building upon the intriguing perspectives in [22], we take a step further and consider the following problem: If the acquired views are mixtures of shared and view-specific components distorted by unknown nonlinear functions, is it still possible to extract the same shared information as in the linear case? This problem is well-motivated, since a large variety of acquired real data are subject to unknown nonlinear distortions, due to various reasons such as limited sensor dynamic range, modeling error, and non-additive noise.

However, taking into consideration of unknown nonlinear distortions makes the problem of interest much more challenging—both in theory and practice. Nonlinearity removal from mixture models was considered under some limited cases, which are mostly single-view analysis problems. Some notable ones are 1) nonlinear independent component analysis (nICA)[21, 20, 37, 1, 31], where the components of interest are statistically independent random processes, and 2) nonlinear mixture learning (NML) [41], where the sought latent components reside in a confined manifold, i.e., the probability simplex. In both cases, the assumptions on the components of interest are leveraged on to come up with identification criteria. However, both statistical independence and the probability-simplex type structure are considered special assumptions, which may not hold in general. In addition, both nICA and NML assume ‘clean’ mixtures without interference present. With multiple views available, can we circumvent these strong assumptions like independence or the probability simplex structure? More importantly, is fending against non-interesting strong interference components still possible under nonlinear settings?

Contributions Bearing these questions in mind, we address both the analytical and computational aspects of nonlinear multiview analysis. Our detailed contributions are as follows:

Model-Based Formulation and Analysis. We propose a multiview nonlinear mixture model that is a natural extension of the mixture model-based linear multiview analysis [22, 4]. To be specific, we model each view as a nonlinear mixture of shared and view-specific interference components, where the nonlinear distortions are unknown continuous invertible functions. We propose an identification criterion for extracting the shared information across views. We show that solving the formulated problem implicitly removes the unknown nonlinearity up to trivial ambiguities—making the nonlinear multiview analysis problem boil down to a linear CCA problem. This means that our formulation enjoys the same identifiability properties as in the linear case [22], despite of working under a much more challenging scenario.

Neural Network-Based Algorithm Design. Based on our formulation, we propose a neural network based implementation. The formulated optimization surrogate is delicately designed to realize the identification criterion in practice. In particular, possible trivial solutions revealed in the analysis are circumvented via a careful construction of the optimization objective and constraints. Based on the formulation, we propose a simple block coordinate descent (BCD) algorithm. The proposed implementation is compatible with existing popular neural network architectures (e.g., convolutive neural networks (CNN) and fully connected neural networks [26]) and is scalable for handling big data.

Extensive Experiments. We test our method in a number of simulations under different scenarios to validate the identifiability theory and to showcase the effectiveness of the implementation. In addition, a couple of real datasets (i.e, a multiview brain imaging dataset and a multiview handwritten digit dataset) are employed to demonstrate the usefulness of the proposed approach for handling real-world problems.

Notation We largely follow the established conventions in signal processing. To be specific, we use to represent a scalar, vector, and matrix, respectively. denotes the Euclidean norm, i.e., and ; denotes the norm which counts the number of nonzero elements of its augment; and denote the transpose and Moore-Penrose pseudo-inverse operations, respectively; the shorthand notation is used to denote a function ; and denote the first-order and second-order derivatives of the function , respectively; denotes the trace operator; denotes an all-one vector with a proper length; denotes an identity matrix with a proper size; denotes the function composition operation.

## Ii Background

In this section, we briefly introduce some preliminaries pertaining to our work.

### Ii-a Mixture Models

Mixture models have proven very useful in signal processing and machine learning. The simplest linear mixture model (LMM) can be written as:

(1) |

where denotes the th observed signal, the mixing system, and the latent components (or, the ‘sources’) measured at sample . Many applications are concerned with identifying and/or ’s from the observed ’s [12, 7, 30, 25]. If both the mixing system and the sources are unknown, estimating and simultaneously poses a very hard problem—which is known as the blind source separation (BSS) problem [6]. The BSS problem is ill-posed, since in general the model is not identifiable; i.e., even if there is no noise, one could have an infinite number of solutions that satisfy (1). Nonetheless, identifiability can be established via exploiting properties of . For example, the seminal work of ICA [7] shows that the identifiability of can be established (up to scaling and permutation ambiguities) leveraging on statistical independence between the elements of . Later on, identifiability of and/or were established through exploiting other properties of the latent components, e.g., nonnegativity [14], convex geometry [15], quasi-stationarity [30, 25], and boundedness [8].

Beyond the LMM, nonlinear mixture models (NMMs) have also attracted much attention—since nonlinearity naturally happens in practice for numerous reasons. For example, starting from the 1990s, a line of work named nonlinear ICA [21, 37, 1, 31] considered the following model:

(2) |

where is a nonlinear mixing system. While it was shown that (2) is in general not identifiable, the so-called post-nonlinear (PNL) model can be identified, again under the premise that and are statistically independent random processes [37, 1, 44]. To be specific, consider the model

(3) |

with and being a univariate nonlinear function. The PNL model, although less general relative to (2), is still very meaningful—which finds applications in cases where unknown nonlinear effects happen individually at multiple sensors/channels, e.g., in brain signal processing [43, 33] and hyperspectral imaging [10]. Under the PNL model, was first shown to be identifiable via exploiting statistical independence [37, 1]. Recently, the work in [41] proved that even if the sources are dependent, some other properties, e.g., nonnegativity and a sum-to-one structure (i.e., ), can be exploited to identify under the PNL model.

### Ii-B Multiview Data and Mixture Learning

In practice, data representing the same entities are oftentimes acquired in different domains—leading to the so-called multiview data. Multiview data has been frequently connected to mixture models, since it is believed that multiple views of the same data sample have certain shared but latent components. In addition, it is commonly believed that using multiple views may have advantages over using a single view for learning problems, e.g., for combating noise and strong interference—since more information is available.

One way of modeling multiview data is using the following linear model [4]:

(4) |

where is the index of the view and is the number of available views. In this work, we mainly consider . But the techniques can be readily extended to cover the case. From the above, one can see that can be used to model the latent shared components across different views, while is the basis of the subspace where the th view is observed. In other words, the different appearances of the views are caused by the differences of the observation subspaces, while the latent representations of the views are identical. In the recent work [22], the above model in (4) is further developed to incorporate view-specific components, where we have:

(5a) | |||

(5b) |

where , is a permutation matrix, collects the shared components and the view-specific components. This model is plausible, since it gives more flexibility for modeling the cross-view disparities. More importantly, it explains the enhanced robustness of multiview analysis against the single view ones, e.g., PCA, for extracting essential information about the data. More specifically, consider the following linear CCA formulation [18, 13]:

(6) | ||||

where , in which we have and the constraints are employed to avoid degenerate solutions. CCA aims to find the most correlated linear projections of both views. Note that CCA was not developed under the particular multiview mixtures in (5). However, it was shown in [22] that, under the model in Eq. (5), solving the above can identify , where is a nonsingular matrix—no matter how strong the view-specific components are. On the contrary, PCA always extracts the energy-wise strongest components.

To understand the result in [22], it might be best to view the CCA cost function form its equivalent form as follows [13, 18]:

(7) | |||

One can check that letting and

makes the cost in (7) zero under the model in (5). The work in [22] further shows that this type of solution is unique up to an arbitrary non-singular . Since , this means that the range space spanned by the shared components (where ) can be extracted. Note that the resulting projected views, i.e.,

is again an LMM, and many techniques introduced in the previous section can be used to identify —with the interference components having been eliminated by CCA.

### Ii-C Nonlinear Multiview Learning

As in the single-view case, nonlinearity is natural to be considered in the multiview case. For example, nonlinear learning tools such as the kernel method and deep learning were combined with CCA, where interesting results were observed [3, 27, 39, 2]. For example, the idea of deep CCA is to employ deep neural networks, instead of the linear operators ’s as in the classic CCA, to perform data transformation. The deep CCA formulation in [2] is based on maximizing the correlation between ’s across views, where is a deep neural network-represented nonlinear transformation. Nevertheless, these works mostly focus on practical implementations, rather than theoretical aspects. It is unclear if the interesting identifiability results as in the classic multiview mixture model in [22] still hold when unknown nonlinearity is imposed onto the views. This is the starting point of our work.

## Iii Proposed Approach

### Iii-a A Nonlinear Multiview Model

In this section, we propose a nonlinear multiview analysis method that aims at learning shared information from the following model:

(8a) | ||||

(8b) |

where the shared components are uncorrelated zero-mean random processes, i.e.,

the function and represents the view specific nonlinear distortion at channel . Note that each is assumed to be invertible—otherwise there is no hope to recover the underlying model. Under the nonlinear setting, we also assume that the latent variables are defined over continuous open sets and , i.e.,

(9) |

Note that this model is a natural extension of the multiview linear mixture models as in [4, 22] (especially the one in [22]) and the PNL model in single-view nonlinear mixture analysis [37, 1, 44].

As we have discussed, one major motivation behind the model in (8) is that in signal sensing and data acquisition, nonlinear distortions oftentimes happen at the sensor end. Hence, a PNL model is considered appropriate for such cases. Nonlinear effects are perhaps particularly severe for biosensors, since biological signals are hard to predict or calibrate. On the other hand, multiple views often exist in biology, e.g., electroencephalogram (EEG), magnetoencephalography (MEG), and functional magnetic resonance imaging (fMRI) measurements for the same stimuli [35, 34, 16]. The proposed model in (8) can be leveraged on to handle such biosensor-acquired multiview data.

We should mention that the model in (8) could also benefit more general data analytics applications such as image/text classification. The model in (8) is evolved from the classic linear mixture model that has been widely used in data science for pre-processing, in particular, dimensionality reduction via computational tools such as PCA, ICA, and CCA. Adding a layer of nonlinearity can help capture more dynamics that were not modeled by the classic tools, which likely leads to better performance. This is the basic intuition behind the deep autoencoder [26] and kernel/deep CCA [27, 2]. Nonetheless, unlike deep autoencoder or deep CCA that are purely data-driven, we take a model-based route in this work to reveal the insights behind the effectiveness of nonlinear multiview learning—which may also shed some light on more principled design of learning criteria and algorithms.

### Iii-B A Function Learning-Based Formulation

Our goal is to learn under the model (8). The learning objective is identical to that in [22], but the model now involves unknown nonlinear distortions, thereby being much more challenging. Note that we do not impose strong structural assumptions on , e.g., statistical independence between and as in [37, 1] and the stochasticity assumption (i.e., , ) as in [41]—existing nonlinear ICA and mixture learning techniques cannot be applied to the problem of interest.

Our idea is to seek a nonlinear mapping and a linear operator such that the following criterion is minimized:

(10a) | |||

(10b) |

The idea is very similar to that of the linear CCA reformulation as in (7), with nonlinearity taken into consideration. Ideally, we wish to obtain

(11a) | |||

(11b) | |||

(11c) |

where is nonsingular. The above will extract the shared row subspace. Also note that the solution in (11) is a legitimate solution to Problem (10). The key question is: Is the solution in (11) the only solution? That is, does the formulation in (10) have identifiability for the shared subspace spanned by the rows of ?

### Iii-C Nonlinearity Removal

To see how we approach the identifiability problem, we re-write Problem (10) in its population form:

[backgroundcolor=white]

(12a) | ||||

(12b) | ||||

(12c) | ||||

(12d) |

Note that the above is derived from (10) assuming that one has uncountably infinite ’s such that all possible values of are exhausted. We use an equality constraint in (12b) since when there is no noise, the optimal value in (10) should be zero under the model in Eq. (8)—attaining zero fitting cost is equivalent to satisfying an equality constraint.

To see the reason why the formulation in (12) can remove nonlinearity and identify the desired signal subspace, let us start with a simple illustrative case where and there is only one shared component. This way, the generative model is simplified as follows:

For this simplified case, the solution becomes a row vector . One can show that the following holds:

###### Theorem 1 (The Case)

Assume that the elements of are drawn from any jointly continuous distribution for , and that and the composition are both twice differentiable. Suppose that for are feasible solutions of Problem (12). Then, we have the following holds almost surely:

where and for ; i.e., the composition is an affine function with probability one.

###### Proof:

The proof is relegated to Appendix A in the supplementary materials. ∎

Theorem 1 shows that even if we have different view-specific components for multiview data, we can always identify the shared components under mild assumptions. This property is quite appealing since no strong assumptions like statistical independence, non-negativity or simplex structure required for —as opposed to existing frameworks [41, 37, 1, 21, 20]. In addition, there is no constraint on the energy of each component—i.e., even if the energy of the shared component is significantly smaller compared to that of the view-specific component, the proposed criterion can still recover the shared subspace. This property is the same as that of the classic linear CCA [22], which we restated in the appendix.

In the proof of Theorem 1, we have implicitly used the fact that there at least exists a solution to Problem (10) such that (i.e., is a “dense” vector). In fact, this holds for more general case where is a matrix. We have the following proposition:

###### Proposition 1

Assume that is drawn from any joint absolutely continuous distribution, where . Then, there always exists a solution of Problem (12) that satisfies

with probability one.

###### Proof:

The proof is relegated to Appendix C in the supplementary materials. ∎

With the insights gained from the case and Proposition 1, we are ready to show the subspace identifiability of the general case. We have the following theorem:

###### Theorem 2 (Nonlinearity Removal)

Consider the nonlinear mixing model in (8). Assume that

and that the mixing matrices for are drawn from any absolutely continuous distributions. Suppose that for are solutions of (12) with . Then, the following holds almost surely:

where and for ; i.e., the composition is an affine function with probability one.

###### Proof:

The proof is relegated to Appendix B in the supplementary materials. ∎

Note that the for the general case, we have an additional constraint , which arises because the general case is more challenging in terms of analysis. To summarize, we have the following proposition:

###### Corollary 1 (Subspace Identifiability)

Under the generative model (8), assume that a feasible solution of Problem (12), denoted by and for , can be found, where . Assume that and for are drawn from certain jointly continuous distribution with . Also assume that the learned functions satisfy that the components of admit zero mean. Then, we have

for a certain nonsingular almost surely.

###### Proof:

We would like to remark that the assumption that has zero mean is natural, since in our model has zero mean. In addition, from a CCA viewpoint, correlation is measured with centered data. However, either of the formulations in (12) or (10) can enforce this directly—from what we have proved in Theorems 1-2, the learned mapping is affine, not linear. That is, there is a constant term existing. Nonetheless, this term can be easily removed via adding a constraint in the formulation—more discussions will be seen in the next section.

### Iii-D Related Works

We should mention that there are a couple of works that are related to mixture models and nonlinear CCA. A classic work in [3] utilizes kernel CCA to solve the ICA problem under the linear mixture model. The method splits to and , where , and then applies kernel CCA to match the transformed and —through which -identification can be achieved. The method uses nonlinear CCA to solve the problem of interest, while the model is not nonlinearly distorted. Another work in [44] considers the model in (3), i.e., . Algorithms were derived for removing nonlinearity, again, through splitting the received signals into two parts, i.e., and , and maximizing the correlation between transformed and . This is effectively multiview matching as we did. Nevertheless, the work does not consider the view-specific interference terms —which is an important consideration in practice. The work also imposes strong assumptions, e.g., Gaussianity of , where , for establishing identifiability of .

## Iv Practical Implementation

The identifiability theorems indicate that the nonlinear functions in the data generation process for may be removed up to affine transformations—if one can find a solution to Problem (12). Problem (12) addresses the population case that requires practical approximations under finite samples and workable parametrization for and . In this section, we cast (12) into a numerical optimization-friendly form and propose an algorithm for tackling the reformulated problem.

### Iv-a Parametrization for Nonlinear Functions

To tackle Problem (12), we first parameterize and using two neural networks. Since we seek for continuous functions, neural networks are good candidates since they are the so-called ‘universal function approximators’. Note that one-hidden-layer networks can already represent all continuous functions over bounded domains to -accuracy with a finite number of neurons [9]. Nevertheless, we keep our parameterization flexible to incorporate multiple layers—which have proven effective in practice [26]. In addition, many established network configurations and architectures can be incorporated into our framework. This is particularly of interest since multiview data oftentimes are with quite diverse forms, e.g., image and text. Using some established neural network paradigms for different types of data (e.g., CNN for image data [26]) may enhance performance.

In our case, since we consider as a nonlinear function that is independent with others, we parameterize it as follows:

where and are the network weights of the input and output layers, respectively, denotes the network weights from layer to layer , is the bias term of layer , and is the so-called activation function. One typical activation function is the function, i.e., There exist many other choices, e.g., the function, and the so-called rectified linear unit function [26]; see Fig. 2 for some examples. Note that different configurations of and lead to different types of neural networks; see [26].

### Iv-B Reformulation

With the neural network-based parametrization for and , one can use (10) as a working surrogate, with proper modifications. To be specific, we consider the following optimization problem:

(13) |

where we introduce a slack variable that represents the extracted shared components (ideally we wish ), is the th input data for the th view, is a neural network (NN)-parametrized element-wise non-linear mapping that we aim at learning for nonlinearity removal, and is another NN-parametrized nonlinear function for learning the generative function in (8). For conciseness, we use and to denote the network parameters in the NNs representing and for , respectively.

To explain the formulation, the first term in the cost function is by lifting the constraint (12b) to the cost function via introducing a slack variable . This is equivalent to the fitting formulation in (10). Note that introducing is important, since it entails us to propose a lightweight algorithm. The second term is to (approximately) enforce to be invertible (i.e., the reflect the invertability constraints in (10) and (12)), at least for the available data samples for . Note that if is invertible, then there exists such that the second term is zero—but the converse is not necessarily true. Nonetheless, using such a data reconstruction to prevent from being an irreversible function in general works, especially when the number of samples is large. This idea is known as the autoencoder [26], which is considered a nonlinear counterpart of PCA. The term corresponds to (10b). An illustration of the proposed neural network based implementation is shown in Fig. 1.

We should highlight the constraint

i.e., the extracted matrix has a zero column mean, which is enforced due to the reason that we discussed after Corollary 1. In addition, adding this constraint is in fact quite vital for avoiding numerical problems. To see this subtle point, recall that according to Theorems 1-2, we have

(14) |

where . If we directly match with , i.e., enforcing

then, one trivial solution is to simply making and , with —i.e., the constant can easily dominate. Hence, we enforce to be zero mean, which will automatically take out .

###### Remark 1

We should mention that we do not constrain in our working formulation since it seems that with zero entries rarely happens if is randomly initialized (it never happened in our extensive experiments). Hence, incorporating such a hard constraint just for ‘safety’ in theory may not be worthy—a lot more complex algorithms may be required for handling this constraint.

### Iv-C Proposed Algorithm

We propose a block coordinate descent (BCD)-based algorithm to handle Problem (IV-B).

#### Iv-C1 The -Subproblem

We first consider the problem of updating the neural networks when fixing . This is an unconstrained optimization problems and can be handled by gradient descent. Denote the loss function as

where collects , and for , and

At iteration , the update rule is simply

(15) |

where is the step size chosen for the th update.

Note that computing is normally not easy, since computing the gradient of neural networks is a resource-consuming process—the gradient normally requires backpropagation (BP) based algorithms to compute in a sample-by-sample manner if multiple layers are involved. Hence, instead of using the full gradient to update , one can also use stochastic gradient, i.e.,

(16) |

where is an index set randomly sampled at iteration .

#### Iv-C2 -Subproblem

To update , we solve the following subproblem:

(17) | ||||

This problem is seemingly difficult since it has two constraints, in which one is nonconvex. However, this problem turns out to have a semi-algebraic solution. To see this, we show the following lemma:

###### Lemma 1

Consider the following optimization problem

(18a) | ||||

(18b) |

An optimal solution is

where and are left and right singular vectors of respectively, with .

###### Proof:

The proof is relegated to Appendix D in the supplementary materials. ∎

One can see that the -subproblem can be re-expressed as

(19) | ||||

subject to |

where . The above can be shown by expanding (17):

Since the first and the last terms are constants, we have the following equivalent problem:

which is exactly equivalent to (19). Then, a solution can be obtained via applying Lemma 1.

The overall algorithm is summarized in Algorithm 1. One can see that the algorithm does not have computationally heavy updates. The most resource-consuming step is the SVD used for updating . However, this step only takes flops, which is still linear in the number of samples. Note that is the number of shared components sought, which is often small in practice. A side note is that we also observe that one can update multiple times until switching to the next block—which often improves the speed for convergence.

###### Remark 2

We should mention that although our technical part was developed under , the theorems and algorithm naturally hold when one has views. Another remark is that many off-the-shelf tricks for speeding up training neural networks, e.g., adaptive step size and momentum [24], can also be employed to determine , which can normally accelerate convergence.

###### Remark 3

In the proposed approach, we use one neural network to parametrize an or . Hence, individual networks are used to approximate (cf. Fig. 1). This follows our PNL-based nonlinear signal model. In practice, if is large, such a parametrization could be costly in terms of both memory and computation. One workaround is to use a single fully connected, multiple-input-multiple-output (MIMO) neural network to approximate (and the same applies to ). This effectively means that all the individual mappings for share neurons in the parameterization (as opposed to each using an individual network). Using a single fully connected network can reduce the computational burden substantially. It also sometimes outputs better results, perhaps because the associated optimization problem better solved. Using a single MIMO network is a valid approximation for our model in (12) since the MIMO neural network can also approximate any MIMO continuous function in principle; see [42, Proposition 1], which is a simple extension of the universal approximation theory for multiple-input-single-output (MISO) functions [9].

### Iv-D Connection to Deep CCA Approaches

An interesting observation is that, although started from very different perspectives, the proposed formulation in (IV-B) and the line of work, namely, deep CCA [2, 39, 5] end up with similar formulations. In particular, the deep canonically correlated autoencoder (DCCAE) [39] formulation is as follows:

(20) |

If one changes the second term in (IV-B) from using to