Title: | Permutation Distribution Clustering |
---|---|
Description: | Permutation Distribution Clustering is a clustering method for time series. Dissimilarity of time series is formalized as the divergence between their permutation distributions. The permutation distribution was proposed as measure of the complexity of a time series. |
Authors: | Andreas M. Brandmaier [aut, cre] |
Maintainer: | Andreas M. Brandmaier <[email protected]> |
License: | GPL-3 |
Version: | 1.1.0 |
Built: | 2024-10-31 22:10:11 UTC |
Source: | https://github.com/brandmaier/pdc |
Permutation Distribution Clustering (pdc) represents a complexity-based approach to clustering time series. Clustering comprises methods that recover similarities in a dataset and represent the findings in group structures. Important applications of clustering include the creation of taxonomies, the discovery of anomalies, or the the discovery of reliably different subgroups for differential analysis or treatment. A crucial parameter in clustering is the choice of the similarity measure between objects. Permutation Distribution Clustering finds similarity in time series based on differences in their permutation distribution as a proxy for differences in their complexity. The permutation distribution is obtained by counting the frequency of distinct order patterns in an m-embedding of the original time series. An embedding of dimension m allows for m! different order patterns. The choice of the embedding dimension crucially influences the clustering result. A small embedding dimension might lead to a permutation distribution with a low representational power, while a large embedding dimension quickly leads to a large permutation distribution that cannot reliably be estimated. With the Minimum Entropy Heuristic (MinE), the embedding dimension can automatically be chosen, thus making the algorithm a parameter-free clustering approach. For clustering time-series, the similarity between two time-series is defined as the divergence between two permutation distributions. PDC is particularly apt for the analysis of psychophysiological time-series because it is efficient (the time complexity is linear in the time-series length), it is robust to drift, time-series of differing length can be compared, and it is invariant to differences in mean and variance of the time-series (choosing a normalization is not essential).
The main function of the package is pdclust
, which performs a hierarchical
clustering of a set of time series based on differences in their permutation distributions.
Other clustering or dimensionality-reduction methods can easily be employed by
directly accessing the distance matrix based on the permutation distribution via pdcDist
.
A heuristic for choosing the embedding dimension is provided via entropyHeuristic
.
For clustering shapes, shape signatures can be traced from images with traceImage
. Example data sets for shape clustering are star.shapes
and complex.shapes
.
Andreas Brandmaier [email protected]
Brandmaier, A. M. (2015). pdc: An R Package for Complexity-Based Clustering of Time Series. Journal of Statistical Software, 67(5), 1–23.
# generate 5 ARMA time series for the first group grp1 <- replicate(5, arima.sim(n = 500, list(ar = c(0.8897, -0.4858), ma = c(-0.2279, 0.2488)), sd = sqrt(0.1796)) ) # generate 5 ARMA time series for the second group grp2 <- replicate(5, arima.sim(n = 500, list(ar = c(-0.71, 0.18), ma = c(0.92, 0.14)), sd = sqrt(0.291)) ) # combine groups into a single dataset X <- cbind(grp1,grp2) # run clustering and color original groups each in red and blue clustering <- pdclust(X,3) plot(clustering, cols=c(rep("red",5),rep("blue",5)))
# generate 5 ARMA time series for the first group grp1 <- replicate(5, arima.sim(n = 500, list(ar = c(0.8897, -0.4858), ma = c(-0.2279, 0.2488)), sd = sqrt(0.1796)) ) # generate 5 ARMA time series for the second group grp2 <- replicate(5, arima.sim(n = 500, list(ar = c(-0.71, 0.18), ma = c(0.92, 0.14)), sd = sqrt(0.291)) ) # combine groups into a single dataset X <- cbind(grp1,grp2) # run clustering and color original groups each in red and blue clustering <- pdclust(X,3) plot(clustering, cols=c(rep("red",5),rep("blue",5)))
A codebook contains the permutation distribution of a time series.
codebook(x, m = 3, t = 1, use.fast=TRUE, normalized = TRUE, codeword_func = NULL)
codebook(x, m = 3, t = 1, use.fast=TRUE, normalized = TRUE, codeword_func = NULL)
x |
a vector ot a time series |
m |
The embedding dimension. |
t |
Time-delay of the embedding. |
use.fast |
Use a fast C-implementation if possible. |
normalized |
Normalize codebook such that it is a probability distribution. |
codeword_func |
Function to compute codewords. If NULL, the default internal function codebook is used. |
The length of a codebook is the factorial of the embedding dimension. The elements of the codebook represent relative frequencies of codewords of size m.
Returns a vector of relative frequencies.
Andreas Brandmaier [email protected]
Brandmaier, A. M. (2015). pdc: An R Package for Complexity-Based Clustering of Time Series. Journal of Statistical Software, 67(5), 1–23.
# calculate codebook from sine-wave cb <- codebook(c(sin(1:100)),m=3) # plot the permutation distribution barplot(cb,xlab="Permutation Distribution")
# calculate codebook from sine-wave cb <- codebook(c(sin(1:100)),m=3) # plot the permutation distribution barplot(cb,xlab="Permutation Distribution")
This data set provides exemplary shape signatures of each five fish, bottles, and pairs of glasses. The planar shapes were derived from original artwork obtained from openclipart.org. Column names encode the type of image.
data(complex.shapes) data(complex.shapes.raw)
data(complex.shapes) data(complex.shapes.raw)
complex.shapes
is a matrix containing 100 rows and 15 columns obtained from trace.image
.
complex.shapes.raw
contains a list containing 15 three-dimensional matrices and an array of corresponding labels.
Andreas Brandmaier [email protected]
Brandmaier, A. M. (2015). pdc: An R Package for Complexity-Based Clustering of Time Series. Journal of Statistical Software, 67(5), 1–23.
Functions to calculate distances/dissimilarities between codebooks.
hellingerDistance(x,y) squaredHellingerDistance(x,y) symmetricAlphaDivergence(x,y)
hellingerDistance(x,y) squaredHellingerDistance(x,y) symmetricAlphaDivergence(x,y)
x |
a codebook |
y |
a codebook |
Note: The symmetric alpha-divergence is proportional to the Squared Hellinger distance, and is the default divergence between codebooks.
Returns a numeric dissimilarity between two codebooks.
Andreas Brandmaier [email protected]
x <- codebook(c(sin(1:100)),m=3) y <- codebook(c(sin(1:100*0.1)),m=3) hellingerDistance(x,y)
x <- codebook(c(sin(1:100)),m=3) y <- codebook(c(sin(1:100*0.1)),m=3) hellingerDistance(x,y)
The information content of a permutation distribution depends crucially on the choice of the embedding dimension. Too small embedding dimensions narrow the representational power of the distribution, too large embedding dimensions dilute the estimation of the distribution. The Minimum Entropy Heuristic (MinE) automatically chooses an embedding dimension with an optimal representational entropy as proxy for representational power.
entropyHeuristic(X, m.min=3, m.max=7, t.min = 1, t.max = 1, normalize_by_observed = TRUE) ## S3 method for class 'mine' print(x, ...) ## S3 method for class 'mine' summary(object, ...) ## S3 method for class 'mine' plot(x, normalize = FALSE, type = "image", mark.optimum = TRUE, col = heat.colors(12), ...)
entropyHeuristic(X, m.min=3, m.max=7, t.min = 1, t.max = 1, normalize_by_observed = TRUE) ## S3 method for class 'mine' print(x, ...) ## S3 method for class 'mine' summary(object, ...) ## S3 method for class 'mine' plot(x, normalize = FALSE, type = "image", mark.optimum = TRUE, col = heat.colors(12), ...)
X |
A matrix representing a set of time series. Columns are time series and rows represent time points. |
x |
An entropy heuristic object of type |
object |
An entropy heuristic object of type |
m.min |
Minimum embedding dimension |
m.max |
Maximum embedding dimension |
t.min |
Minimum time-delay |
t.max |
Maximum time-delay |
normalize_by_observed |
Boolean. If false, entropy is normalized by dividing by the logarithm of the number of all possible patterns. If true, it is divided by the logarithm of the number of observed patterns. |
... |
Further arguments for the generic print, summary, and plot method. |
normalize |
Normalize values to range [0;1]. |
type |
Either 'image' or 'contour'. Specifies the plot type. |
mark.optimum |
Mark the optimal embedding dimension and/or time-delay. |
col |
A color map to represent entropy values on. |
For a range of embedding dimensions, the average entropy of the dataset is calculated. The embedding dimension with
the lowest entropy is chosen. print
and plot
is available for result objects.
The plot of a heuristic object shows the entropy values depending on a range of embedding dimensions and time-delays. If only embedding dimension or only time-delay is varied, a line plot is show to indicate the parameter yielding minimum entropy. Otherwise, an image plot is shown that indicates minimum entropy depending on both parameters.
A list is returned with the following elements:
m |
The chosen embedding size. |
entropy.values |
A vector with average entropy values corresponding to each entry in |
entropy.ms |
A vector of the embedding dimensions that were searched for the optimal embedding. |
Andreas Brandmaier [email protected]
Brandmaier, A. M. (2015). pdc: An R Package for Complexity-Based Clustering of Time Series. Journal of Statistical Software, 67(5), 1–23.
Brandmaier, A. M. (2012). Permutation Distribution Clustering and Structural Equation Model Trees. Doctoral dissertation. Saarland University, Saarbruecken, Germany.
# (1) # # create a sine-wave with added noise # and display a plot showing the average permutation entropy # depending on varying choices of the embedding size # (by default time-delay is not searched over) heuristic <- entropyHeuristic( sin(1:100)+rnorm(100,0,1) ) plot(heuristic) # (2) # # calculate both optimal embedding dimension and time-delay # heuristic <- entropyHeuristic( sin(1:100)+rnorm(100,0,1), t.min=1, t.max=6 ) plot(heuristic)
# (1) # # create a sine-wave with added noise # and display a plot showing the average permutation entropy # depending on varying choices of the embedding size # (by default time-delay is not searched over) heuristic <- entropyHeuristic( sin(1:100)+rnorm(100,0,1) ) plot(heuristic) # (2) # # calculate both optimal embedding dimension and time-delay # heuristic <- entropyHeuristic( sin(1:100)+rnorm(100,0,1), t.min=1, t.max=6 ) plot(heuristic)
Evaluates a clustering distance matrix within a supervised learning scheme: leave-one-out one-neares-neighbor cross-validation. This yields a rough estimate of the suitabilty of a distance function for discriminating between classes if ground-truth is known.
loo1nn(x, y)
loo1nn(x, y)
x |
A |
y |
A vector of the true class labels. |
Returns a percentage-correct estimate.
Andreas Brandmaier [email protected]
Brandmaier, A. M. (2015). pdc: An R Package for Complexity-Based Clustering of Time Series. Journal of Statistical Software, 67(5), 1–23.
Plots a two-dimensional projection to the principal coordinates of all observations. Clusters are shown as polygonal convex hulls of their members.
mdsPlot (X, labels = NULL, col = "gray")
mdsPlot (X, labels = NULL, col = "gray")
X |
A |
labels |
Optional. A vector of labels for the observations. If NULL, column names of the dataset are used. |
col |
A vector of colors for polygon shading. |
Andreas Brandmaier [email protected]
Brandmaier, A. M. (2015). pdc: An R Package for Complexity-Based Clustering of Time Series. Journal of Statistical Software, 67(5), 1–23.
data("complex.shapes") truth <- c(rep("fish",5),rep("bottle",4),rep("glasses",5)) clust <- pdclust(complex.shapes, t=5) mdsPlot(clust, truth, col=c("lightblue","lightgreen","lightgray"))
data("complex.shapes") truth <- c(rep("fish",5),rep("bottle",4),rep("glasses",5)) clust <- pdclust(complex.shapes, t=5) mdsPlot(clust, truth, col=c("lightblue","lightgreen","lightgray"))
This function computes and returns the distance matrix computed by the divergence between permutation distributions of time series.
pdcDist(X, m = NULL, t = NULL, divergence = symmetricAlphaDivergence)
pdcDist(X, m = NULL, t = NULL, divergence = symmetricAlphaDivergence)
X |
A matrix representing a set of time series. Columns are time series and rows represent time points. |
m |
Embedding dimension for calculating the permutation distributions. Reasonable values range usually somewhere between 2 and 10. If no embedding dimension is chosen, the MinE heuristic is used to determine the embedding dimension automatically. |
t |
Time-delay of the embedding |
divergence |
Divergence measure between discrete distributions. Default is the symmetric alpha divergence. |
A valid divergence is always non-negative.
Returns the dissimilarity between two codebooks as floating point number (larger or equal than zero).
Andreas Brandmaier [email protected]
Brandmaier, A. M. (2015). pdc: An R Package for Complexity-Based Clustering of Time Series. Journal of Statistical Software, 67(5), 1–23.
# create a set of time series consisting # of sine waves with different degrees of added noise # and two white noise time series X <- cbind( sin(1:500)+rnorm(500,0,.1), sin(1:500)+rnorm(500,0,.2), sin(1:500)+rnorm(500,0,.3), sin(1:500)+rnorm(500,0,.4), rnorm(500,0,1), rnorm(500,0,1) ) # calculate the distance matrix D <- pdcDist(X,3) # and plot with lattice package, you will # be able to spot two clusters: a noise cluster # and a sine wave cluster require("lattice") levelplot(as.matrix(D), col.regions=grey.colors(100,start=0.9, end=0.3))
# create a set of time series consisting # of sine waves with different degrees of added noise # and two white noise time series X <- cbind( sin(1:500)+rnorm(500,0,.1), sin(1:500)+rnorm(500,0,.2), sin(1:500)+rnorm(500,0,.3), sin(1:500)+rnorm(500,0,.4), rnorm(500,0,1), rnorm(500,0,1) ) # calculate the distance matrix D <- pdcDist(X,3) # and plot with lattice package, you will # be able to spot two clusters: a noise cluster # and a sine wave cluster require("lattice") levelplot(as.matrix(D), col.regions=grey.colors(100,start=0.9, end=0.3))
Hierarchical cluster analysis for time series. Similarity of time series is based on the similarity of their permutation distributions.
pdclust(X, m = NULL, t = NULL, divergence = symmetricAlphaDivergence, clustering.method = "complete") ## S3 method for class 'pdclust' plot(x, labels=NULL, type="rectangle", cols="black", timeseries.as.labels = T, p.values=F, ...) ## S3 method for class 'pdclust' str(object, ...) ## S3 method for class 'pdclust' print(x, ...)
pdclust(X, m = NULL, t = NULL, divergence = symmetricAlphaDivergence, clustering.method = "complete") ## S3 method for class 'pdclust' plot(x, labels=NULL, type="rectangle", cols="black", timeseries.as.labels = T, p.values=F, ...) ## S3 method for class 'pdclust' str(object, ...) ## S3 method for class 'pdclust' print(x, ...)
X |
In the univariate case: A matrix representing a set of time series. Columns represent different time series and rows represent time. In the multivariate case: A three-dimensional matrix with the first dimension representing time, second dimension representing multivariate time series, and the third dimension representing variables. |
m |
Embedding dimension for calculating the permutation distributions. Reasonable values range somewhere between 2 and 10. If no embedding dimension is chosen, the MinE heuristic is used to determine the embedding dimension automatically. |
t |
Time-delay of the embedding. |
divergence |
Divergence measure between discrete distributions. Default is the symmetric alpha divergence. |
clustering.method |
Hierarchical clustering linkage method. One out of c("complete","average","single"). |
For plotting:
x |
A |
labels |
Optionally provide a vector of labels for the time series here. |
type |
One of c("triangle","rectangle") to choose the dendrogram style. |
cols |
Specify line color either as string or as vector of strings |
timeseries.as.labels |
If |
p.values |
Annotation of the cluster hierarchy with p values |
... |
Further graphical arguments. |
For string representation:
object |
A |
The function pdclust
is the central function for clustering time-series in the package pdc
.
It allows clustering of univariate and multivariate time-series.
If time-series have different length, the shorter time-series can be padded
with NA
s to bring them to columns of the same length in an array or a
matrix.
Multivariate time-series can also be handled by pdclust
. Therefore,
the data must be transformed into a three-dimensional matrix with the
dimenions representing (1) time, (2) entities, and (3) variables/channels.
Calls to pdclust
return a pdclust
object. There are
print
, str
and plot
methods for pdclust
objects.
Andreas Brandmaier [email protected]
Brandmaier, A. M. (2015). pdc: An R Package for Complexity-Based Clustering
of Time Series. Journal of Statistical Software, 67(5), 1–23.
Brandmaier, A. M. (2012). Permutation Distribution Clustering and Structural Equation Model Trees. Doctoral dissertation. Saarland University, Saarbruecken, Germany.
pdcDist
entropyHeuristic
symmetricAlphaDivergence
# generate 5 ARMA time series for the first group grp1 <- replicate(5, arima.sim(n = 500, list(ar = c(0.8897, -0.4858), ma = c(-0.2279, 0.2488)), sd = sqrt(0.1796)) ) # generate 5 ARMA time series for the second group grp2 <- replicate(5, arima.sim(n = 500, list(ar = c(-0.71, 0.18), ma = c(0.92, 0.14)), sd = sqrt(0.291)) ) # combine groups into a single dataset X <- cbind(grp1,grp2) # run clustering and color original groups each in red and blue clustering <- pdclust(X) plot(clustering, cols=c(rep("red",5),rep("blue",5)))
# generate 5 ARMA time series for the first group grp1 <- replicate(5, arima.sim(n = 500, list(ar = c(0.8897, -0.4858), ma = c(-0.2279, 0.2488)), sd = sqrt(0.1796)) ) # generate 5 ARMA time series for the second group grp2 <- replicate(5, arima.sim(n = 500, list(ar = c(-0.71, 0.18), ma = c(0.92, 0.14)), sd = sqrt(0.291)) ) # combine groups into a single dataset X <- cbind(grp1,grp2) # run clustering and color original groups each in red and blue clustering <- pdclust(X) plot(clustering, cols=c(rep("red",5),rep("blue",5)))
Plots a horizontal dendrogram with images as leafs
rasterPlot(cl, raw, monochrome=FALSE, aspect, ...)
rasterPlot(cl, raw, monochrome=FALSE, aspect, ...)
cl |
A |
raw |
List of RGB images in matrix format. |
monochrome |
Convert image to b/w representation. |
aspect |
Aspect ratio. |
... |
Further graphical arguments. |
Andreas Brandmaier [email protected]
Brandmaier, A. M. (2015). pdc: An R Package for Complexity-Based Clustering of Time Series. Journal of Statistical Software, 67(5), 1–23.
data("complex.shapes") data("complex.shapes.raw") clust <- pdclust(complex.shapes, t=5) rasterPlot(clust, complex.shapes.raw$images)
data("complex.shapes") data("complex.shapes.raw") clust <- pdclust(complex.shapes, t=5) rasterPlot(clust, complex.shapes.raw$images)
This data set provides exemplary shape signatures of Column names are coded 'X-Y-Z' with X indicating the number of points (4 = 4-point-star or 5 = 5-point-star), Y indicating the size (0=100%, 1=150%), Z indicating rotation (0=0 degree, 1=45 degree).
data(star.shapes) data(star.shapes.raw)
data(star.shapes) data(star.shapes.raw)
star.shapes
is a matrix containing 100 rows and 8 columns obtained from traceImage
.
star.shapes.raw
contains a list of three-dimensional matrices containing the raw images and an array of corresponding labels.
Andreas Brandmaier [email protected]
Brandmaier, A. M. (2015). pdc: An R Package for Complexity-Based Clustering of Time Series. Journal of Statistical Software, 67(5), 1–23.
Trace the shape of an image to create a shape signature
traceImage(img, resolution) convertImage(imgrgb, threshold = 0.5)
traceImage(img, resolution) convertImage(imgrgb, threshold = 0.5)
img |
a matrix of size width x height representing a black/white image with 0=white and 1=black. |
resolution |
The angular resolution of the trace, i.e., the length of the resulting shape signature. |
imgrgb |
a matrix of size width x height x channels representing an RGB image. |
threshold |
The average intensity value that serves as boundary to separate black and white pixels. |
Shape signatures of objects can be created by unrolling their contour around its centroid across time. The resulting time series represents distance-to-center of points on the contour versus radial angle.
In order to create signatures for RGB images, convert the image with convert.image
to a black-and-white image using a threshold
between 0 and 1.
Exemplary datasets containing shape signatures for shape clustering are provided in this package as star.shapes
and complex.shapes
.
Returns a list containing angle
s and corresponding distance
s from center.
Andreas Brandmaier [email protected]
Brandmaier, A. M. (2015). pdc: An R Package for Complexity-Based Clustering of Time Series. Journal of Statistical Software, 67(5), 1–23.
# create a filled rectangle in a 20x20 image img <- matrix(0, nrow=20,ncol=20) img[5:15,5:15] <- 1 # create shape signature signature <- traceImage(img) # plot both original image and shape signature par(mfrow=c(1,3)) #layout(matrix(c(1,2,2), 1, 3, byrow = TRUE)) image(img) plot(signature$angle, signature$distance,type="l",xlab="angle",ylab="distance") # reconstruct radial plot require("plotrix") radial.plot(traceImage(img,resolution=500)$distance,start=0,rp.type="r",radial.lim=c(0,10))
# create a filled rectangle in a 20x20 image img <- matrix(0, nrow=20,ncol=20) img[5:15,5:15] <- 1 # create shape signature signature <- traceImage(img) # plot both original image and shape signature par(mfrow=c(1,3)) #layout(matrix(c(1,2,2), 1, 3, byrow = TRUE)) image(img) plot(signature$angle, signature$distance,type="l",xlab="angle",ylab="distance") # reconstruct radial plot require("plotrix") radial.plot(traceImage(img,resolution=500)$distance,start=0,rp.type="r",radial.lim=c(0,10))