Title: | Clustering and Feature Screening using L1 Fusion Penalty |
---|---|
Description: | Provides the Big Merge Tracker and COSCI algorithms for convex clustering and feature screening using L1 fusion penalty. |
Authors: | Trambak Banerjee [aut, cre], Gourab Mukherjee [aut], Peter Radchenko [aut] |
Maintainer: | Trambak Banerjee <[email protected]> |
License: | GPL(>=2) |
Version: | 1.0.0 |
Built: | 2024-11-09 03:21:36 UTC |
Source: | https://github.com/trambakbanerjee/fusionclust |
Solves an L1 relaxed univariate clustering criterion and returns a
sequence of values where the clusters merge
bmt(x, alpha = 0.1, small.perturbation = 10^(-6))
bmt(x, alpha = 0.1, small.perturbation = 10^(-6))
x |
observation vector |
alpha |
merging threshold. Default is 0.1 |
small.perturbation |
a small positive number to remove ties. Default is 10^(-6) |
solves a convex relaxation of the univariate clustering criterion given by equation
(2) in the referenced paper and generates a sequence of cluster merges and corresponding
values. See algorithm 1 in the referenced paper for more details.
path - number of clusters on the big merge path
lambda.path - sequence of lambda where clusters merge
index - cluster index at the point where clusters merge
merge - merge points
split - split points
prob - merging proportion
boundaries - cluster boundaries
P. Radchenko, G. Mukherjee, Convex clustering via l1 fusion penalization, J. Roy. Statist, Soc. Ser. B (Statistical Methodology) (2017) doi:10.1111/rssb.12226.
library(fusionclust) set.seed(42) x<- c(rnorm(1000,-2,1), rnorm(1000,2,1)) out<- bmt(x)
library(fusionclust) set.seed(42) x<- c(rnorm(1000,-2,1), rnorm(1000,2,1)) out<- bmt(x)
Ranks the p features in an n by p design matrix where n represents the sample size and p is the number of features.
cosci_is(dat, min.alpha, small.perturbation = 10^(-6))
cosci_is(dat, min.alpha, small.perturbation = 10^(-6))
dat |
n by p data matrix |
min.alpha |
the smallest threshold (typically set to 0) |
small.perturbation |
a small positive number to remove ties. Default value is 10^(-6) |
Uses the univariate merging algorithm bmt
and produces a score
for each feature that reflects its relative importance for clustering.
a p vector of scores
Banerjee, T., Mukherjee, G. and Radchenko P., Feature Screening in Large Scale Cluster Analysis, Journal of Multivariate Analysis, Volume 161, 2017, Pages 191-212
P. Radchenko, G. Mukherjee, Convex clustering via l1 fusion penalization, J. Roy. Statist, Soc. Ser. B (Statistical Methodology) (2017) doi:10.1111/rssb.12226.
library(fusionclust) set.seed(42) noise<-matrix(rnorm(49000),nrow=1000,ncol=49) set.seed(42) signal<-c(rnorm(500,-1.5,1),rnorm(500,1.5,1)) x<-cbind(signal,noise) scores<- cosci_is(x,0)
library(fusionclust) set.seed(42) noise<-matrix(rnorm(49000),nrow=1000,ncol=49) set.seed(42) signal<-c(rnorm(500,-1.5,1),rnorm(500,1.5,1)) x<-cbind(signal,noise) scores<- cosci_is(x,0)
Once you have the feature scores from cosci_is
, you can select the features
based on a pre-defined threshold,
using table A.10 in the paper[1] to determine an appropriate threshold or,
using a data driven approach described in the references to select the features and obtain an implicit threshold value.
cosci_is_select implements option 3.
cosci_is_select(score, gamma)
cosci_is_select(score, gamma)
score |
a p vector of scores |
gamma |
what proportion of the p features is noise? If your sample size n is smaller than 100, setting gamma = 0.85 is recommended. Otherwise set gamma = 0.9 |
Converts the problem of screening out features with lower scores into a problem in large scale multiple testing and uses the procedure described in reference [2] to determine the signal features.
a vector of selected features
Banerjee, T., Mukherjee, G. and Radchenko P., Feature Screening in Large Scale Cluster Analysis, Journal of Multivariate Analysis, Volume 161, 2017, Pages 191-212
T. Cai, W. Sun, W., Optimal screening and discovery of sparse signals with applications to multistage high throughput studies, J. Roy.Statist. Soc. Ser. B (Statistical Methodology) 79, no. 1 (2017) 197-223
library(fusionclust) set.seed(42) noise<-matrix(rnorm(49000),nrow=1000,ncol=49) set.seed(42) signal<-c(rnorm(500,-1.5,1),rnorm(500,1.5,1)) x<-cbind(signal,noise) scores<- cosci_is(x,0) features<-cosci_is_select(scores,0.9)
library(fusionclust) set.seed(42) noise<-matrix(rnorm(49000),nrow=1000,ncol=49) set.seed(42) signal<-c(rnorm(500,-1.5,1),rnorm(500,1.5,1)) x<-cbind(signal,noise) scores<- cosci_is(x,0) features<-cosci_is_select(scores,0.9)
Estimates the number of clusters from the bmt
run
nclust(bmt_output, prob_threshold = 0.5)
nclust(bmt_output, prob_threshold = 0.5)
bmt_output |
output from the |
prob_threshold |
probability threshold. Default is 0.5. Do not change it unless you know what you are doing. See the referenced paper |
Estimates the number of clusters as the number of big merges + 1. The probability threshold is an adjustment that renders this estimation process more robust to sampling fluctuations. If the sum of the sample frequencies for the two merging clusters in the last big merge is less than 50 percent, we do not report any merges and thus are left with just 1 cluster. See the referenced paper for more details.
The number of clusters
P. Radchenko, G. Mukherjee, Convex clustering via l1 fusion penalization, J. Roy. Statist, Soc. Ser. B (Statistical Methodology) (2017) doi:10.1111/rssb.12226.
library(fusionclust) set.seed(42) x<- c(rnorm(1000,-2,1), rnorm(1000,2,1)) out<- bmt(x) k<- nclust(out)
library(fusionclust) set.seed(42) x<- c(rnorm(1000,-2,1), rnorm(1000,2,1)) out<- bmt(x) k<- nclust(out)