# Visualizing High-Dimensional Data With Parallel Coordinates

The remarkable ability of the human brain to recognize patterns in graphical representations of data is the main reason that visual methods still play a crucial role in data-mining tasks. Humans are capable of detecting relevant as well as irrelevant structures in data and drawing the correct conclusions from them, if the data is visualized in a suitable way. However, developing appropriate tools for visualization purposes is a very challenging task, especially for high-dimensional data, which always – due to the limited human spatial perception of max. three dimensions – implies a mapping to lower dimensions. Parallel Coordinates (commonly ||-Coordinates), invented by Alfred Inselberg [1] follow such an approach by mapping high dimensional data to 2D using a comprehensible methodology. In this blog post we will briefly introduce ||-Coordinates, in order to investigate their properties. Furthermore, we will apply ||-Coordinates to various datasets and compare the results to several other visualization techniques.

Note: All R-scripts and other resources used for this blog post can be found on GitHub at https://github.com/MarkusThill/MultiDiminsionalViz

## Introduction

In many application areas nowadays frequently large, high-dimensional datasets with unknown content are generated. In order to gain insights and find the essential information hidden in the data, these datasets have to be explored systematically and carefully. Very often, it is not clear in the beginning, how and with which tools to approach unknown, complex data. Typically, the raw data is presented in textual or tabular form, which is usually hard to analyze for most people and makes it impossible to reveal the concealed structures. To get a first understanding of the existing data, visual representations can support the user analyzing and finding meaningful patterns. However, the human perception is limited to a maximum of three dimensions which makes it necessary to map data with more than three dimensions to lower dimensional space (commonly two dimensions) in order to present it in a graphical form. The challenge of this mapping is to preserve as much – for the user important – structure in the data as possible, by approaching the data with an appropriate visualization technique. Selecting a suitable tool for visualization is not always trivial. Often the user can choose from a large group of possible candidates, which all give a different view on the provided data and which all have their individual advantages and disadvantages. Visualization techniques range from geometric methods such as the well-known scatter plots over pixel-oriented techniques (e.g., dimensional stacking [2]) to icon-based methods (Chernoff-faces [3]).

In this blog post we present a geometric-based visualization technique called Parallel Coordinates (in the following ||-Coordinates), which was especially designed to visualize multidimensional and multivariate data [1]. ||-Coordinates for two dimensions were already described in the year 1885 by Maurice d’Ocagne and independently discovered again in 1959 by Alfred Inselberg, who was not aware of d’Ocagnes previous work ([4], p. 16). The main idea of ||-Coordinates is to transform high-dimensional datasets to two dimensions $(\mathbb{R}^N \rightarrow \mathbb{R}^2)$ by arranging the set of axes (one axis per dimension) in parallel to each other and representing samples by polygonal lines. This 2-dimensional representation is well suited for visual exploration and pattern-recognition and simplifies the search for relations in the data [5]. In practice, ||-Coordinates have been successfully applied in various fields, e.g. for automatic collision detection and avoidance algorithms in air traffic control (with three patents in the USA) or for data-mining and optimization tasks [6]. we will use ||-Coordinates intensively throughout this blog post by applying this technique to diverse datasets and exploring its specific characteristics.

## Fundamentals of Parallel Coordinates

### Construction and Properties

Consider a $N$-dimensional dataset containing a number of $k$ observations. The construction of a ||-Coordinates-graph visualizing this data can be described as follows:

Analogous to the Cartesian coordinate system, for every dimension one individual axis is required, therefore, in total, a set of $N$ axes is created. All axes are then placed in parallel to each other in the plane, usually equidistant and vertically arranged. In detail, a ||-Coordinates-graph can be viewed as a set of parallel $Y$-axes embedded in a $XY$-Cartesian coordinate system. For the $i$-th dimension, (of in total $N$), the Y-axis is copied and placed at the position $d_i$ – which describes the distance of the $i$-th axis to the origin – on the $X$-axis. Usually, the distance between all axes is chosen to be equal ($d_1 = d_2 = \ldots = d$). Formally, every axis in ||-Coordinates is labeled $\overline{X}_i$. In total there are $N-1$ segments (the regions between two adjacent axes). Assuming $N$ axes, there are $\frac{N(N-1)}{2}$ possible pairs of axes and $N!$ possible permutations for the arrangement of all $N$ axes. A point $P = (x_1, x_2, \cdots, x_N)$ in the $n$-dimensional Cartesian coordinate system can be mapped to a polygonal line in ||-Coordinates by simply connecting the $x_i$-values of the point $P$ on all corresponding axes $\overline{X}_i$ with straight lines. A set of $k$ points therefore leads to $k$ polygonal lines in ||-Coordinates [1] [7] [8]. Inselberg states that the representational complexity of ||-Coordinates is $O(N)$, since every additional dimension simply results in an extra axis [5]. In general, this statement is true, however, as we will mention later, the axis-ordering in a ||-Coordinates-graph plays an important role. So, in many cases several ||-Coordinates plots – with various axis-orderings – have to be created and compared, which may indirectly increase the complexity. There are no restrictions regarding the dimensionality of the data for Parallel Coordinates. Theoretically, ||-Coordinates is able to handle an infinite number of dimensions, for practical purposes we did not find information on a reasonable maximum of dimensions; this usually depends on many factors such as size of the dataset, screen-resolution and more.

### Examples

The following examples illustrate many properties of ||-Coordinates in a nice way. For example, negatively correlated variables (in the example: a straight line with negative slope) are translated to lines in the ||-Coordinates-graph, which have intersections.

### Point-Line Duality

Much theoretical work for ||-Coordinates has been done by Inselberg (missing reference) [8], including the point-line duality for ||-Coordinates. At this point we want to give a brief idea: The point-line duality describes the dual relation between points and lines, hence, that points deﬁned in the Cartesian coordinate system can be mapped to lines in the parallel-coordinates domain and vice versa. As mentioned before a set of points in a Cartesian Coordinate system is mapped to a set of polygonal lines in ||-Coordinates. For instance a point $P = (p_1, p_2)$ in a 2-dimensional Cartesian coordinate system is mapped to a line that connects the points $y=p_1$ on the $\overline{X}_1$-axis and $y=p_2$ on the $\overline{X}_2$-axis. Assuming that $d_2=(0,d)$, this leads to a line which can be described with

It can easily be shown that a set of points sampled on a linear curve ($P_we = (p_1, m \cdot p_1 + b)$) in Cartesian Coordinate system result in a set of lines in ||-Coordinates, that all intersect in the point $\overline{P_{s}}$.

This intersection-point corresponds to exactly one line in the $XY$-domain, which then completes the point-line duality theorem. As seen from equation \eqref{ch3:eq:pointToLineIntersec}, if the slope of the sampled linear curve $\ell$ is $% $ then the lines $\overline{\ell_i}$ in the ||-Coordinates-system will intersect between the axes $\overline{X}_1$ and $\overline{X}_2$, for $m>1$ the virtual extension of all $\overline{\ell_i}$ will intersect on the right-hand side $\overline{X}_2$ and analogous for $% $ on the left-hand side of $\overline{X}_1$. The special case $m=1$ will result in a set of lines $\overline{\ell_i}$ that do not intersect in the ||-Coordinates. Inselberg describes the underlying theory in [4] (chapter 3) in much more detail.

In general, ||-Coordinates can be applied to a wide range of high- dimensional visualization-problems, in very different application-areas. Maybe the most important purpose of ||-Coordinates is, simply giving a general overview of the visualized data. ||-Coordinates can provide a lot of information on the first glance: Outliers can be easily identified, as well as other anomalies or patterns. This first summary that ||-Coordinates provide, can help the user proceed in his data-mining tasks and give a direction for further work on the data. Cluster-analysis – the task of finding sets of samples with a similar structure – can as well be done with the help of ||-Coordinates. Trivial clusters can already be found by simply looking at the individual axis, clusters in higher dimensions can be identified based on characteristics such as density, proximity and slopes of the polygonal lines, intersection of lines with the axes, and more. Often, visual clustering in ||-Coordinates is combined with the search for correlations in the data. ||-Coordinates are able to reveal many different types of correlation, e.g., a strong negative linear correlation results in a set of lines that intersect in one point between the two corresponding axes. Furthermore, ||-Coordinates can be used to support the user in classification or to some extent also in regression-tasks [7]. For instance, it is possible to color samples of different classes in order to compare the characteristics between the classes and find the attributes with the best class-separating features for further processing (finding a subset of attributes that describe the data best is not necessarily limited to classification tasks, but could be a general task as well). ||-Coordinates could also support the evaluation or the development of classifiers, for example, interactively in combination with algorithmic classifiers, as proposed in [9] [10]. In some cases, the user may use ||-Coordinates simply for verification-purposes (perhaps in combination with one or more of the previous tasks) or as a report tool without the intention of knowledge-discovery.

### Common Issues

Although ||-Coordinates are very often a good tool for visualization of high-dimensional data, there are some common issues that one should be aware of. Overplotting (often denoted as visual clutter) is one main problems that users may face. Due to the large number of records in a dataset, the vast number of lines overlay and can cover important patterns in the plot [7]. There are different approaches to handle visual clutter in ||-Coordinates. A possible solution could be (randomly) sampling the data to a certain degree, which often preserves the main patterns in the plot and makes them visible. However, this approach is associated with a loss of information and can lead to a misinterpretation of the data. Different sampling-techniques are described in [11]. Further clutter-reduction techniques are: Brushing, density-based methods, aggregation and, axis-reconfiguration. Brushing-techniques allow the user to (interactively) highlight selected observations and to blind out others. Density-based methods, such as alpha blending (which uses opacity to make dense regions appear more intense), are commonly applied to ||-Coordinates. Aggregating samples in ||-Coordinates, e.g., by clustering, can reduce the amount of visual clutter, as well as axis-reordering and axis-inversion. A summary of further clutter-reduction techniques can be found in [7]. The axis-ordering of ||-Coordinates is an important issue for revealing patterns in the data. Inexperienced users often only find structures between adjacent axes and also more experienced users normally need to evaluate different axis-constellations, in order to find structures; usually, the ordering of the axes has a large effect on the appearance of ||-Coordinates. However, selecting a suitable axis-ordering is not trivial. Analyzing all possible permutations of the axes may be possible for low dimensions, but with $N!$ permutations (for $N$ variables), this task gets more and more difficult with increasing dimensionality. Considering the fact that only all pairwise relations have to be analyzed, reduces the number of necessary permutations tremendously. The ||-Coordinates Matrix (PCM) [12] is one approach addressing this problem by showing all pairwise relations with only comparably few ||-Coordinates plots, as discussed in the following section. Another solution for the axis-ordering problem could involve automated analysis methods, which rank a set of ||-Coordinates candidates according to certain measures and only offer a small selection of these to the user, as proposed in [13]. A further problem that frequently occurs in ||-Coordinates (especially with discrete axes), is the intersection of two or more lines at exactly the same position on an axis. This makes it impossible to follow the affected polygonal lines (there is more than one possibility, how the line could continue) and therefore to distinguish the involved samples. A simple solution for this problem could be the usage of different colors, which may work for small datasets, but gets unpractical for more complex data. Other approaches use curves instead of lines or the interactively highlight certain samples [7].

Since ||-Coordinates were introduced by Inselberg, a lot of further work has been done extending ||-Coordinates, targeting certain problems (e.g., overplotting), combining ||-Coordinates with other visualization-techniques, specializing the technique for certain tasks, such as clustering, and more. Hybrid visualization techniques, which combine two or more methods, are often designed with the goal to utilize the advantages of the integrated methods, in order to compensate their disadvantages. Many examples for such hybrid techniques using ||-Coordinates can be found. In [14] scatter plots and multidimensional scaling are integrated into ||-Coordinates. Fanea et al. combine star glyphs with ||-Coordinates, addressing the clutter-problem in ||-Coordinates. In [15], ||-Coordinates is coupled with RadViz (a radial visualization technique), using RadViz for brushing, clustering, and coloring of the observations and ||-Coordinates for visualizing the quantitative information of the data. A couple of publications extend the idea of ||-Coordinates to a visualization technique of three dimensions instead of two, for example: R"ubel et al. [16] use planes (basically representing scatter plots) instead of axis and connect the corresponding points on all planes with polygonal lines. In [17], isosurfaces based on the densities in the ||-Coordinates plot are generated, which give the user a better feeling of the regions in the plot with higher concentrations of samples. Dang et al. [18] follow a similar idea in order to overcome the overplotting-problem by stacking overlapping elements. Typically, ||-Coordinates are defined for a set of samples or in other words discrete data-points. In [19], ||-Coordinates were extended for data in the continuous domain by using a density model. As mentioned before, the ||-Coordinates-Matrix (PCM) [12] was introduced with the aim to find appropriate axis-orderings with a preferably small number of ||-Coordinates plots. The method is based on graph-theoretical considerations: Every axis represents a vertex in a graph. The graph is considered to be complete (describing all possible pairwise relations between the individual axes). By applying the Hamiltonian decomposition to the graph, a set of all possible Hamiltonian paths (for even $N$) or Hamiltonian cycles (for odd $N$) is created. A Hamiltonian path visits every vertex exactly one. Hamiltonian cycles are Hamiltonian paths that form a cycle; in the axis-ordering-problem they can be retrieved by calculating the Hamiltonian paths for $N-1$ and adding the remaining dimension to the beginning and end of every path. This approach, using the PCM, reduces the number of ||-Coordinates plots needed to visualize all axis-combinations to $\lfloor \frac{N}{2} \rfloor$. We developed an implementation of PCM in R [20], that determines Hamiltonian paths and -cycles and displays the corresponding ||-Coordinates graphs in one plot. The ||-Coordinates Matrix method will be used intensively for our analysis in later sections.

## Applications and Experimental Setup

In this section we introduce a few datasets that will be used to explore the features and limits of ||-Coordinates. The datasets vary in the number of dimensions, observations, and furthermore in the data types of the attributes. The Out5d and the Pollen dataset have the smallest number of dimensions (both five), the MiniBooNe dataset has the highest dimensionality (51 attributes). Some datasets will be used for classification purposes, others for clustering, or simply for finding patterns, correlations, and outliers. In the following, we will apply ||-Coordinates to the individual datasets and analyze the results. For this purpose, we developed scripts realizing ||-Coordinates plots and ||-Coordinates-Matrix-plots (based on the Hamiltonian decomposition described in [12]) using R [20] and the R-package ggplot2. R already provides inbuilt ||-Coordinates methods, such as ggpcp, however, the range of functionality was not sufficient for our purposes). In our ||-Coordinates, we will use brushing (e.g., for highlighting clusters), coloring of samples (based on their class or other features), axis-inversion (which helps to reduce clutter, especially for negative correlated segments), data-scaling functionalities and, very intensively, the ||-Coordinates-Matrix.

### The Pollen dataset

The pollen dataset describes geometric features of pollen grains [21]. The dataset was assembled by David Coleman of RCA Labs in Princeton, USA in 1986 and was used as the American Statistical Association (ASA) Exposition dataset for a competition. In total the dataset contains 3848 observations on 5 variables (ridge, nub, crack, weight, density). The first three variables represent the lengths of geometric features observed from sampled pollen grain: a “ridge” for the $x$-dimension, a “nub” along the $y$-axis and a “crack” in the $z$-direction. The fourth variable describes the pollen grain weight and the last dimension represents the density.

Since the number of dimensions is rather small and the dataset contains some interesting patterns, we will use this dataset to briefly describe certain features of ||-Coordinates.

### The Out5D Dataset

The 5-dimensional Out5d dataset contains 16384 observations of remotely sensed data, collected from a western region of Australia, for the Worcester Polytechnic Institute in 2005 by Peter Ketelaar [22]. Unfortunately, the author of this dataset does not provide much general information on the dataset, we could only find the following remarks: The measurements for the records were performed on a $128 \times 128$ grid, which leads in total to 16384 records with radiometric information on each grid-cell. The five dimensions are spot, magnetics and three bands of radiometrics: potassium, thorium and uranium.

The Out5d-dataset was used in several publications, which apply different methods to the data in order to reveal patterns. Among others we found [23], [15], [24], [25], [26], [27], [28], [29], [30], which all explore the Out5d-data. We will compare our results with a few of the here mentioned papers.

### The Wine Quality Dataset

The wine quality dataset contains in total $6497$ samples with $12$ attributes of red (1599 samples) and white (4898 samples) Vinho Verde wines from the north-western region Minho of Portugal, collected from May/2004 to February/2007 (the dataset was retrieved from [31]). Eleven of the twelve attributes were determined by objective Physicochemical laboratory tests: fixed acidity, volatile acidity, citric acidity, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates and alcohol. The last attribute, quality is a subjective measure in the range of 0 (very bad) to 10 (excellent) which represents the preferences of three certified wine tasters who graded the wines. Details can be found in [32]. Similar to [32], we will try to use this dataset to find some general potential relationships between objective Physicochemical measures and the wine-taste preferences and additionally, explore the main differences of the red and white wines from the Vinho Verde region. Finding such relations in the data could make it possible to control some of the objective variables in a purposeful way to increase the quality of the corresponding wine. While Cortez et al. [32] are mainly applying Support Vector Machines, multiple regression and neural network methods to the dataset, we will simply use ||-Coordinates as the main technique, in order to find some trivial patterns in the data. Beside [32], we found additional publications (missing reference), [33], [34], [35], that use the wine quality dataset, mainly for regression or classification purposes.

### MiniBooNe Particle Identification Dataset

The MiniBooNe dataset (retrieved from [31]) was generated for the MiniBooNe experiment performed at Fermilab in Batavia, Illinois, with the goal of finding neutrino oscillations [36]. The dataset contains 130065 samples that either describe – with in total 50 attributes – electron neutrinos or moun neutrinos. In the following we will refer to electron neutrinos as signal events and moun neutrinos as background events. The MiniBooNe dataset is the largest dataset – in both, number of observations and number of dimensions – we will use for this blog post. In one of the following sections, we will apply ||-Coordinates to the dataset and try to perform some simple visual classifications on the data, attempting to distinguish observations between signal events and background events. we could not find publications visualizing this dataset and not much information on the attributes of the dataset and therefore simply label them from V1 to V51, whereby V51 describes the class of the observation.

## Analysis

### Introductory Example: The Pollen Dataset

In this section, we will explore the pollen dataset, mainly using ||-Coordinates.

To get a first impression of the dataset we simply create a ||-Coordinates plot of all observations without any special adjustments or parameter-settings as seen in Fig. 3. Because of the high number of observations the plot has a high degree of overplotting. Nevertheless it is possible to already get a few insights into the data.

Due to the high degree of overplotting which results in a classical ||-Coordinates plot we apply alpha blending – a density-based approach, which uses opacity to highlight regions with higher density – to ||-Coordinates , with $\alpha = 0.05$. The corresponding plot is displayed in Fig. 4.

The data does not seem to contain significant outliers. All observations appear to be centered around zero for the 5 given attributes. The hyperbolic envelopes in the plot seem to be a strong indication for normally distributed data and indeed, a density plot strengthens this hypothesis. All attributes are roughly normally distributed, with differing standard deviations. However, the density plot apparently contains an anomaly around the mean value of every attribute.

Also, the ||-Coordinates plot shows a small region with higher density around the value zero that is unusual for normally distributed data and could explain the anomaly in the density plot. A cluster with this appearance in normally distributed data is not very likely, we therefore take a closer look and zoom into the plot and try to extract this cluster.

In figures 6 and 7, we can clearly identify a certain structure in the cluster found before. This structure does not appear to be random and should be examined in more detail. From the ||-Coordinates it can already be seen that a few trivial patterns – such as linear correlations – are included in this structure. For example, the attributes ridge and nub appear to be positively correlated, while nub and crack, crack and weight, and weight and density are negatively correlated on the other hand. Permuting the axis-ordering using the ||-Coordinates-Matrix shows that all attributes are linearly correlated.

By creating a scatter plot matrix of the extracted cluster, we can finally reveal the hidden information in the dataset. The points in many of the subplots form the word “EUREKA”. As example the scatter plot for the attribute weight against density is displayed in Figure 8.

The complete data was totally artificially generated and the attribute names as well as the dataset name itself were just chosen in order to mislead the participants of the competition into thinking the data was related to biology.

### The OUT5D Dataset

In this section, we will apply ||-Coordinates to the Out5d dataset, mainly to find patterns and correlations in the data. we first simply generate a ||-Coordinates plot without any further adjustments or parameters.

Due to the rather high number of observations (in total 16384), it is hardly possible to find any information in the plot, only very basic characteristics can be seen: The dataset does not contain negative values. The maximum values for the first three axis (spot, magnetics and potassium) are around 250, in fact: the summary-command in R shows that the max. value is 255, which indicates that byte-variables were used to represent the dimensions. Furthermore, no significant outliers can be seen in this plot. The higher-valued regions on the magnetics- and potassium-axis could suggest that a negative correlation is present for the corresponding observations, however, a general statement is not possible at this stage due to the overplotting in the ||-Coordinates plot.

We try to reduce the clutter by applying alpha-blending to the plot.

A value of $\alpha=0.01$ is chosen, which allows us to see patterns in the dataset for the first time. However, still a certain degree of clutter remains. One method to reduce visible clutter in ||-Coordinates plots can be the inversion of individual axis, in order to decrease the number of crossing lines – especially when negative correlations are present. We inverse the magnetics-axis, in expectation of reducing the clutter slightly more and getting a different view on the Out5d-data. Fig. 11 shows the result of the inversion.

On the magnetics-axis we can see many observations with a value higher than 250 and lower than 100. Also for potassium, a large number of records reach the max. value of 255. As already assumed before, we can see a somewhat negative correlation between magnetics and potassium. High magnetics-values seem to correspond to low spot-values, on the other hand low values on the magnetics-axis spread evenly over the whole spot-axis. Altogether, it seems that it is possible to separate records with very high and low magnetics as well in the other dimensions potassium, thorium and uranium.

To be more precise (according to figure above): A negative correlation between magnetics and potassium is visible, but with some slight differences; high potassium-values lead to small magnetics and vice versa. On the other hand, large magnetics-values correspond to a comparably low value of potassium, but spread in a wider range. Furthermore, a low amount of potassium does not necessary implicate high magnetics. In the area around 100 (real value: 155) on the magnetics-axis we recognize another higher concentration of observations. We count around 1300 records for magnetics in the range $[155, 170]$ (in comparison: The range $[135, 150]$ only contains approx. 340 records). In Fig. 12 these observations – and two other sets of observations – are highlighted by brushing. With exception of the attribute spot, most of the highlighted black-colored observations can also be clustered in the other 4 remaining dimensions. Also, the samples in blue form a fairly dense cluster in the plot.

Another approach to get further insights into the data could be the usage of the so called ||-Coordinates-Matrix (PCM). The ||-Coordinates-Matrix displays all pairwise relations between the attributes and can be easily created for datasets with only few dimensions; for the Out5d dataset only two ||-Coordinates plots are needed. Because of the odd number of attributes, we construct the Hamiltonian cycles by first calculating all possible Hamiltonian paths for complete graph with $N=4$ vertexes and then adding the 5-th remaining attribute to the beginning and end of every row in the Hamiltonian matrix. In total, two such Hamiltonian cycles are possible:

Using this matrix it is possible to show all pairwise relations in just two ||-Coordinates plots.

The ||-Coordinates-Matrix reveals a few more interesting patterns in the Out5d dataset: In the pairwise relation between uranium and magnetics, we can observe a couple of features which could not be seen that clear before. Another relation that was not seen before, is between potassium and uranium.

Most of the clusters found for the Out5d dataset in [27] – which, however does not cover all possible combinations of attributes – could also easily be seen in our ||-Coordinates-Matrix. In [23], the authors Malik and Ünlü, using Parallel Coordinates, describe a strong negative correlation between magnetics and potassium, which was also found in this blog post. In [15], ||-Coordinates are combined with Radviz (Radial Visualization) in order to simplify the clustering-process. The authors find having low values of thorium, potassium and uranium correspond to high magnetic values. In addition, a cluster with high uranium- and potassium-values and low magnetic values was observed. A closer look on our ||-Coordinates also leads to this conclusion.

Although the Out5d dataset was used to simply illustrate certain features of the ||-Coordinates in this example, the structure of the data indicates “potentially anomalous data”, as also stated in [23]: Both attributes, magnetics and potassium, have around 20% of all observations located at the max. value of 255. This is not necessarily problematic, but due to the fact that only few information on the data is available, these records should be treated carefully.

In every row of the ||-Coordinates-Matrix the attribute uranium is used twice – in the beginning and in the end of the line.

The Out5d-dataset was used in other publications. We will briefly compare the results of a few of them with the results made in this blog post. For more results refer to the perviously mentioned citations. Most of the clusters found for the Out5d-dataset in [27] –which, however does not cover all possible combinations of attributes – can also easily be seen in this ||-Coordinates-Matrix.
In [23] the authors Malik and Ünlü describe the strong negative correlation between magnetics and Potassium, which was also found in this blog post. Furthermore they find single outliers (in more than one dimension).
In [15] ||-Coordinates are combined with Radviz in order to simplify the clustering-process. The authors find having low values of thorium, potassium and uranium correspond to high magnetic values. In addition a cluster with high uranium- and potassium-values and low magnetic values was observed. A closer look to the ||-Coordinates-Matrix in Fig. 13 also leads to this conclusion.

### The Wine Quality Dataset

Analogous to the sections before, we will at this point explore the wine quality dataset with the help ||-Coordinates plots. The major goal will be finding certain patterns in the data considering the quality of both, red and white wines by visualizing the data in an appropriate way. Furthermore we want to discover the main differences between red and white wines. In respect to the large number of samples, plotting the whole dataset (even separately for white and red-wines) using a general ||-Coordinates plot is not recommendable, the degree overplotting is simply too large. Alpha-blending is again our first step in order to get a first convenient plot. Nevertheless, the ordering of the columns is not optimal, so the amount of clutter in the plot is still rather large. By finding suitable orderings using the ||-Coordinates-Matrix and inverting a few axes, it is possible to reduce the clutter a little more. In total, six ||-Coordinates plots are necessary for the PCM. Among all six candidates, we try to find the constellations that reduce clutter the most and make clusters visible.

Additionally, extreme outliers are removed from the data set to achieve a better overall distribution of all the lines in the plot. In total, 31 samples are removed from the white wine dataset and 17 from the red wine dataset; however, only outliers that are not part of the highest and lowest wine-quality groups are removed. The result for the white-wine dataset can be seen in figure 14: It gives us a first overview of the data with its value-ranges, clusters and other patterns. The red wine dataset can be handled in a similar manner. However, it is still not possible to make many statements regarding the quality of the data. Only very few white wines were graded with the quality of 9 and none for the red wines. The number of bad wines is fairly small as well, the worst red- and white wines get a rating of 3.

In order to to get a better understanding of the influence of the objective attributes on the wine quality, we decide to implement a coloring-scheme for our R-implementation of the ||-Coordinates plot. The colors for each sample are chosen according to the assigned quality using a heat-based coloring model; records with a bad quality will be colored yellow, whereas high-quality wines will be colored red. This approach should make it possible to visualize general quality-related patterns.
Again, we select an appropriate axis-setting using the ||-Coordinates-Matrix and axis-inversion. Simply comparing the best and worst wines – by brushing the corresponding subsets – in a ||-Coordinates plot could help as well, finding general correlations.

The attributes that appear more important regarding their effect on the wine-quality, are chosen to be closer to the quality-axis. The corresponding ||-Coordinates plot for the white wines is shown in the figure below.

we found the attributes alcohol, density and volatile acidity (in descending order) to have the highest importance for the white wine dataset. In general white wines with a higher alcohol content seem to have a higher quality. In fact, the average alcohol concentration for wines with a quality larger or equal to 7 is around $11.5\%$; for qualities lower or equal to 5 the concentration is just around $9.8\%$, which is quite significant when considering the total range of alcohol ($[8\%, 14.2\%$]). A second important variable is the density of the white wines. Low densities (note that the density-axis is inverted in figure 16) indicate higher qualities (as well as higher alcoholic concentrations) and vice versa. Furthermore, also low values of the attribute volatile acidity correlate with higher qualities, even though the correlation is not as clear as for alcohol and density.

The ||-Coordinates plot also shows some other interesting information, e.g.: Although sulphates does not appear to be generally useful for our purposes, in the higher-value regions ($>0.75 g/dm^3$) we can mostly find samples with higher qualities. Furthermore, a closer look between pH and fixed acidity reveals the negative correlation between both attributes; a higher fixed acidity leads to a lower pH-value, which seems to be reasonable (note again that the axis for fixed acidity was inverted). The same effect – even though not that clear – can be observed for citric acidity. In contrast to our results, Cortez et al. [32] find the attribute sulphates to be the most important one, followed by alcohol, residual sugar and citric acid. The most likely reason for this could be, that in their work Support Vector Machines were used for regression, which certainly have different importance-measures than visual approaches. In Nachevs and Stoyanovs work [33] the three most important variables (determined by the symmetrical uncertainty importance ranking) are alcohol, density and chlorides. A Random Forest built by us suggests alcohol, density and volatile acidity as most important variables.

For the red wine data (figure 17), alcohol again appears to have the highest impact on the wine quality: The wine tasters gave better grades to wines with higher alcoholic concentration. But in contrast to the white wines, density is not as useful to draw conclusions regarding the red wine quality, on the other hand sulphates seems more convenient. Another observation we can make is, that especially very bad graded red wines have a low citric acidity and high volatile acidity and vice versa. Also, for the red wine dataset we find the negative correlation between pH and fixed acidity or alternatively citric acidity.

Finally – as last step in this section – we want to investigate the main differences between the red and white wines. The ||-Coordinates plot comparing both types of wines can be found in figure 18. The differences between red and white-wines are easily recognizable, e.g.: The amount of total sulfur dioxide is rather low for red wines in comparison to white wines. Besides this, the values for sugar and free sulfur dioxide are much lower for red wines as well and also in most other parts of the plot the differences are observable. Based on our visual impression, the attributes total sulfur dioxide, sugar and free sulfur dioxide are the most suitable (important) for classification of both wine types.

### The MiniBooNe Particle Identification Dataset

The MiniBooNe dataset is the last dataset we want to analyze with Parallel Coordinates. It has the largest number of observations (130065) and dimensions (50, excluding the class information), which makes the handling of this data rather challenging; for the applied techniques and the computer-hardware. Our R-implementation of ||-Coordinates has difficulties handling this large number of records, the creation of one plot takes too much time. Therefore, we decided to sample the dataset randomly in order to reduce the number of observations. Unless otherwise stated, a sampling-size of 20,000 observations is chosen. The MiniBooNe dataset contains 36,499 signal events and 93,565 background events, which makes it unbalanced. Sampling the data would again lead to an unbalanced subset of the data; it has to be decided from case to case, if this is desired. We want to weight both classes equally, thus we randomly pick 10,000 instances from each class.

A first ||-Coordinates plot of the selected 20,000 observations shows that many attributes have a minimum value of $-999$. In total there are 468 such records, which we remove from the dataset. Additionally, we automatically remove outliers that prevent a good view on the ||-Coordinates plot, in total not more than 1\% of the selected 20,000 observations. Another plot of the dataset, containing all 50 attributes, can be seen in figure 20. Although the ||-Coordinates plot contains many observations in a large dimension-space, it is already possible to find many patterns in the dataset. We get a first overview of the value-ranges of all attributes, many appear to be normally distributed. If we take a closer look at the segments between V38 and V39, we can identify the typical hyperbolic structure that is formed by two normally distributed variables. But also other patterns can be found: For example, many attributes such as V1 and V17 seem to separate both classes quite clearly, while V34, V41 and, V50 do not appear to have suitable class-separating properties. Also, regions in the plot can be found, where the data is widely spread and, other regions that contain clusters. Signal events are widely spread between V15 and V17, whereas background events are rather dense. In this region, many correlations between individual (neighbored) variables can be found: V16 and V17 seem to be positively correlated, on the other hand V31 and V32 are negatively correlated.

As mentioned before, not all 50 attributes appear to be suitable for (visual) classification purposes. In the following, we try to find the most significant variables and focus on them. Figure 21 shows a ||-Coordinates plot containing only those attributes which we, after intensive analysis and tests (axis-reordering, axis-inversion), consider as convenient for classifying signal events and background events. In total, 13 attributes are selected and the most suitable ||-Coordinates plot by using the ||-Coordinates-Matrix is determined. The first subjective view of the plot suggests that variable V1 has the best class-separating property, followed by V17 and V16 which, in fact, are ranked very high in a Random Forest as well (all three belong to the 5 most important variables). Also, V3 seems to be a suitable attribute for class-separation, even though most samples are concentrated in low value regions on the axis.

Although ||-Coordinates is mainly a visualization technique and not really suitable for real classification purposes, humans capability of finding relations and other patterns in visualized data can make ||-Coordinates an important tool to evaluate, adjust and verify algorithmic classification approaches. It may even be thinkable, to use the geometric features – such as slope of the polygonal lines or their proximity – of ||-Coordinates for design and support of classification-algorithms. Adding geometric features – by introducing further decision variables – of ||-Coordinates to the classification-process, for example in decision trees, could help improving classification-algorithms (e.g., the slope of a polygonal line in one or more segments could be more suitable as a class-separating factor than the actual values on the axes).

Based on our visual impressions so far, we can develop a trivial classifier. The resulting classifier is described in the images below:

## Discussion and Conclusion

In this blog post, we presented ||-Coordinates, a visualization technique invented by Alfred Inselberg for multidimensional and multivariate data. We mentioned the fundamental properties of ||-Coordinates and applied it to various datasets, differing in the dimensionality, number of attributes and attribute characteristics, in order to examine its features and potential issues.

Even though a (visual) classifier based on parallel coordinates does not seem to be a simple task, future work could investigate in how far it is possible to use the geometric features of ||-Coordinates (such as slopes and proximities) to design or support classification algorithms. contained observations of around 6500 Portuguese red- and white-wines. In total 11 objective attributes such as alcohol, pH and density describe the individual wines, a 12th attribute – the wine quality – was assessed by wine-tasters. %and also find other relations between the attributes – e.g., a higher concentration of alcohol was more likely to result in a good wine

As an introductory example we applied ||-Coordinates to the Pollen dataset, in order to find the hidden information in it. With ordinary ||-Coordinates, it was not possible to gain many insights into the data, due to the high number of observations that caused overplotting. Only when alpha blending is added to the plot, it is possible to identify the cluster. Visualizing this cluster in a scatter plot finally unveils the hidden secret.

The second dataset we analyzed was the Out5d-dataset. The number of dimensions of this dataset is rather small and can be easily handled by ||-Coordinates; but not so the number of observations (in total 16384). Again, with the help of alpha blending we can again encounter overplotting problem and reduce the clutter. As seen in this example, the ||-Coordinates-Matrix is a suitable method when searching for patterns or clusters in the data and can provide more interesting and useful views.

The main task for the wine-quality dataset, which we used as third example, was to find the relation between the 11 objective attributes and the wine quality. we found, that – next to alpha blending and axis re-ordering – axis inversion can be helpful in many cases in order to reduce clutter, especially when negative correlations are present. Using a heat color-scheme based on the wine quality, made it possible to examine the influence of certain attributes on the quality of the wine.

The last dataset we explored – with the highest number of observations (more than 130,000) and dimensions (in total 51) among all other datasets – was the MiniBooNe particle identification dataset, which is mainly used for classification-purposes. It was not possible to display the whole dataset with ||-Coordinates, we had to limit ourselves to 20,000 samples. Also, finding relations and patterns among all attributes was rather difficult, hence, we reduced the number of attributes by selecting those 13 which appeared to have the best (visual) class-separating properties. After a couple of axis re-configurations, we found a display that could clearly separate the two classes of the dataset.

Overall, we found ||-Coordinates to be a suitable and helpful tool in our data-mining tasks, if used in an appropriate way and adapted to the specific dataset. One main advantage we see, is the comprehensibility: Even though ||-Coordinates have a complex underlying theory, they can be easily applied and read, even by inexperienced users. Especially overplotting was one issue we faced when applying ||-Coordinates to our fairly large datasets. Overplotting causes polygonal lines to cover important patterns. We addressed this problem, by using a density-based technique – the so called alpha blending– that introduces a certain degree of transparency, which can then make clusters and other patterns visible. Our impression was, that ||-Coordinates show relations (e.g., linear correlations or normally distributed attributes, indicated by hyperbolic envelopes) in the data clearer than heatmaps and dimensional stacking. ||-Coordinates were able to reveal ordered structures in data, e.g., in the pollen dataset that contained mostly random noise. However, in a few cases we wrongly assumed negative correlations, where actually no specific correlation was present. We could often disprove the wrong assumption by inverting the concerned axis. Inverting (flipping) individual axis, results in negated slopes of the lines in the corresponding segment. This can also help reducing clutter and give a better view on the data. However, similar to dimensional stacking, the axis-configuration is an important factor for revealing structures (in dimensional stacking the axis-ordering affects the appearance of the plot even more). As one approach, to address this problem, we introduced the ||-Coordinates-Matrix, which allows to analyze all pairwise relations between the attributes with a fairly small number of ||-Coordinates-plots.
After a suitable axis-configuration was found, ||-Coordinates was a strong tool for revealing clusters in more than one dimension, by using the features proximity, slope and density of the polygonal lines. Heatmaps and dimensional stacking did not appear to be as suitable for clustering purposes. In contrast to heatmaps, where the color is already a main part of the value representation, we can use colors in ||-Coordinates to give additional information to the plot (e.g. the class-information). Displaying large datasets with our ||-Coordinates-implementation was only possible to a certain degree. We had to sample the MiniBooNe dataset, whereas for example dimensional stacking could use all observations (but is restricted to only 10 dimensions on the other hand). Nevertheless, by evaluating diagrams for different sample-sizes, we could show that the main structure in ||-Coordinates was preserved, in this respect, ||-Coordinates seems to be a rather robust technique. The handling of outliers was one additional issue we found considering Parallel-Coordinates (for heatmaps and dimensional stacking as well): On the one hand outliers can be easily spotted with ||-Coordinates, but on the other hand, they destroy the whole view of the plot, because the majority of the observations is located in a small region.

Even though we encountered a few problems in the beginning – for example, overplotting as one main problem – we could obtain many interesting results from the analyzed datasets using ||-Coordinates, which we unfortunately could not all place in this blog post. Nevertheless, it was astonishing to see how a comparatively simple technique such as ||-Coordinates – without any noteworthy extensions or enhancements – is able to reveal the patterns, relations and other information hidden in raw data.

## References

1. A. Inselberg, “The plane with parallel coordinates.,” The Visual Computer, vol. 1, no. 2, pp. pp. 69–91, 1985.
2. J. LeBlanc, M. O. Ward, and N. Wittels, “Exploring N-Dimensional Databases.,” in Visualization, 1990, Los Alamitos, CA, 1990, pp. 230–237.
3. H. Chernoff, “The Use of Faces to Represent Points in K-Dimensional Space Graphically,” Journal of the American Statistical Association, vol. 68, no. 342, pp. pp. 361–368, 1973.
4. A. Inselberg, “Parallel Coordinates: Visual Multidimensional Geometry and Its Applications.,” in Knowledge Discovery and Information Retrieval, Barcelona, 2012.
5. A. Inselberg, “Multidimensional detective,” in IEEE Visualization 1997, Los Alamitos, CA, 1997, pp. 100–107.
6. D. Plemenos and G. Miaoulis, Visual Complexity and Intelligent Computer Graphics Techniques Enhancements, vol. 200. Berlin, Heidelberg: Springer, 2009, pp. pp. 123–141.
7. J. Heinrich and D. Weiskopf, “State of the Art of Parallel Coordinates,” in STAR Proceedings of Eurographics 2013, 2013, pp. 95–116.
8. A. Inselberg and B. Dimsdale, “Parallel Coordinates: A Tool for Visualizing Multi-dimensional Geometry,” in IEEE Visualization 1990, Los Alamitos, CA, 1990, pp. 361–378.
9. Y. Xu, W. Hong, N. Chen, X. Li, W. Liu, and T. Zhang, “Parallel Filter: A Visual Classifier Based on Parallel Coordinates and Multivariate Data Analysis,” Advanced Intelligent Computing Theories and Applications. With Aspects of Artificial Intelligence, pp. pp. 1172–1183, 2007.
10. C. A. Steed, J. E. Swan II, P. J. Fitzpatrick, and T. J. Jankun-Kelly, “A Visual Analytics Approach for Correlation, Classification, and Regression Analysis,” Oak Ridge National Laboratory, Oak Ridge, TN, Technical Report ORNL/TM-2012/68, 2012.
11. G. Ellis and A. J. Dix, “Enabling Automatic Clutter Reduction in Parallel Coordinate Plots.,” IEEE Transactions on Visualization and Computer Graphics, vol. 12, no. 5, pp. pp. 717–724, Jan. 2007.
12. J. Heinrich, J. Stasko, and D. Weiskopf, “The Parallel Coordinates Matrix,” in EuroVis, Vienna, 2012, pp. 37–41.
13. A. Tatu et al., “Combining automated analysis and visualization techniques for effective exploration of high-dimensional data.,” in IEEE Visual Analytics Science and Technology, Piscataway, NJ, 2009, pp. 59–66.
14. X. Yuan, P. Guo, H. Xiao, H. Zhou, and H. Qu, “Scattering Points in Parallel Coordinates.,” IEEE Transactions on Visualization and Computer Graphics, vol. 15, no. 6, pp. pp. 1001–1008, 2009.
15. E. Bertini, L. Dell’Aquila, and G. Santucci, “SpringView: cooperation of radviz and parallel coordinates for view optimization and clutter reduction,” in Coordinated and Multiple Views in Exploratory Visualization, 2005., Los Alamitos, CA, 2005, pp. 22–29.
16. O. Rübel et al., “PointCloudXplore: Visual Analysis of 3D Gene Expression Data Using Physical Views and Parallel Coordinates.,” in EuroVis, Vienna, 2009, pp. 203–210.
17. M. Streit et al., “3D parallel coordinate systems - A new data visualization method in the context of microscopy-based multicolor tissue cytometry,” Cytometry Part A, vol. 69A, no. 7, pp. pp. 601–611, 2006.
18. D. T. Nhon, L. Wilkinson, and A. Anand, “Stacking Graphic Elements to Avoid Over-Plotting.,” IEEE Transactions on Visualization and Computer Graphics, vol. 16, no. 6, pp. pp. 1044–1052, 2010.
19. J. Heinrich and D. Weiskopf, “Continuous Parallel Coordinates,” IEEE Transactions on Visualization and Computer Graphics, vol. 15, no. 6, pp. pp. 1531–1538, 2009.
20. R Core Team, “R: A Language and Environment for Statistical Computing.” R Foundation for Statistical Computing, Vienna, Austria, 2016.
21. D. Coleman, “Pollen Data-Set.” RCA Labs, Princeton, USA, 1986.
22. P. Ketelaar, “The OUT5D dataset\\.” Curtin University of Technology, Perth, Australia, 2005.
23. W. A. Malik and A. Ünlü, “Interactive Graphics: Exemplified with Real Data Applications,” Frontiers in Psychology, vol. 2, no. 11, 2011.
24. H. Makwana, S. Tanwani, and S. Jain, “Article: Axes Re-Ordering in Parallel Coordinate for Pattern Optimization,” International Journal of Computer Applications, vol. 40, no. 13, pp. pp. 43–48, 2012.
25. H. Zhou, W. Cui, H. Qu, Y. Wu, X. Yuan, and W. Zhuo, “Splatting the Lines in Parallel Coordinates.,” Computer Graphics Forum, vol. 28, no. 3, pp. pp. 759–766, Sep. 2009.
26. C. Yu, D. Yurovsky, and T. L. Xu, “Visual Data Mining: An Exploratory Approach to Analyzing Temporal Patterns of Eye Movements,” Infancy, vol. 17, no. 1, pp. pp. 33–60, 2012.
27. G. Palmas, M. Bachynskyi, A. Oulasvirta, H.-P. Seidel, and T. Weinkauf, “An Edge-Bundling Layout for Interactive Parallel Coordinates,” in IEEE PacificVis, Los Alamitos, CA, 2014, pp. 44–54.
28. J. Johansson and M. D. Cooper, “A Screen Space Quality Method for Data Abstraction.,” Comput. Graph. Forum, vol. 27, no. 3, pp. pp. 1039–1046, 2008.
29. A. O. Artero, M. C. F. de Oliveira, and H. Levkowitz, “Uncovering Clusters in Crowded Parallel Coordinates Visualizations.,” in IEEE Visualization 2004, Los Alamitos, CA, 2004, pp. 81–88.
30. E. Bertini, L. Dell’Aquila, and G. Santucci, “Reducing cluttering through non uniform sampling, displacement, and user perception,” in Visualization and Data Analysis 2006, San Jose, CA, 2006, vol. 6060, pp. 60600L–60600L-12.
31. K. Bache and M. Lichman, “UCI Machine Learning Repository,” 2013.
32. P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis, “Modeling wine preferences by data mining from physicochemical properties ,” Decision Support Systems , vol. 47, no. 4, pp. pp. 547–553, 2009.
33. A. Nachev and B. Stoyanov, “Product Quality Analysis using Support Vector Machnines,” Information Models and Analyses, vol. 1, no. 2, pp. pp. 170–193, 2012.
34. A. Lambrou, H. Papadopoulos, I. Nouretdinov, and A. Gammerman, “Reliable Probability Estimates Based on Support Vector Machines for Large Multiclass Datasets,” in Artificial Intelligence Applications and Innovations, vol. 382, L. Iliadis and others, Eds. Springer, Berlin, Heidelberg, 2012, pp. pp. 182–191.
35. A. Dasgupta and R. Kosara, “Pargnostics: Screen-Space Metrics for Parallel Coordinates,” IEEE Transactions on Visualization and Computer Graphics, vol. 16, no. 6, pp. pp. 1017–1026, 2010.
36. B. P. Roe, H.-J. Yang, J. Zhu, Y. Liu, I. Stancu, and G. McGregor, “Boosted decision trees as an alternative to artificial neural networks for particle identification,” Nuclear Instruments and Methods in Physics Research, vol. A543, pp. pp. 577–584, 2005.

## Appendix

### The Wine-Quality dataset

The attributes are summarized in the following table:

 Attribute Unit Shortcut Fixed acidity g(tartaric acid)∕dm3 f_acid Volatile acidity g(acetic acid)∕dm3 v_acid Citric acid g∕dm3 c_acid Residual sugar g∕dm3 sugar Chlorides g(sodium chloride)∕dm3 chlorides Free sulfur dioxide mg∕dm3 f_sulfur Total sulfur dioxide mg∕dm3 t_sulfur Density g∕cm3 density pH -- pH Sulphates g(potassium sulphate)∕dm3 sulphates Alcohol vol. % alcohol

### Markus Thill

I studied computer engineering (B.Sc.) and Automation & IT (M.Eng.). Generally, I am interested in machine learning (ML) approaches (in the broadest sense), but particularly in the fields of time series analysis, anomaly detection, Reinforcement Learning (e.g. for board games), Deep Learning (DL) and incremental (on-line) learning procedures.

### Deriving a Closed-Form Solution of the Fibonacci Sequence using the Z-Transform

The Fibonacci sequence might be one of the most famous sequences in the field of mathmatics and computer science. Already high school stu...… Continue reading

#### Derivation of a Weighted Recursive Linear Least Squares Estimator

Published on May 05, 2019

#### Gaussian Distribution With a Diagonal Covariance Matrix

Published on May 04, 2019