Install FlashX
This document shows the installation of FlashX with the R programming interfaces. Currently, FlashX provides R interfaces: FlashR and FlashGraphR. The installation steps have been tested in Ubuntu 14.04 and Ubuntu 16.04.
Library dependency
To install FlashX, users need to install R first. In Ubuntu, users can install FlashR as follows:
$ sudo sh -c "echo \"deb http://cran.rstudio.com/bin/linux/ubuntu xenial/\" >> /etc/apt/sources.list"
$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
$ sudo apt-get update
$ sudo apt-get install -y r-base-core
To run FlashX faster and use disks to scale to large datasets, users
needs to install some additional libraries: libaio, libnuma, libhwloc, libatlas
.
All of the libraries are optional. Users need to install these
libraries before compiling the code of FlashX.
libaio
is required to take advantage of SSDs to scale computation to large datasets.libnuma
is required for machines with more than two processor sockets.libhwloc
is required to tune FlashX automatically to achieve the best speed for a given hardware.libatlas
is a faster BLAS implementation and it can accelerate matrix multilication in FlashR.
In Ubuntu, users can install the additional libraries as follows:
sudo apt-get install -y libnuma-dev libaio-dev libhwloc-dev
sudo apt-get install -y libatlas-base-dev
Install FlashR & FlashGraphR from Github directly
FlashR is uploaded to a Github repo and FlashGraphR is uploaded to a Github repo. We can install FlashR & FlashGraphR in R as follows.
> install.packages("Rcpp")
> install.packages("RSpectra")
> install.packages("https://github.com/flashxio/FlashR/releases/download/FlashR-latest/FlashR.tar.gz", repos=NULL)
Similarly, we can install FlashGraphR as follows:
> install.packages('igraph', repos = 'http://cran.rstudio.com/')
> install.packages("https://github.com/flashxio/FlashGraphR/releases/download/FlashGraphR-latest/FlashGraphR.tar.gz", repos=NULL)
** NOTE: FlashGraphR relies on FlashR. Please install FlashR first before installing FlashGraphR.**
Install FlashR & FlashGraphR manually
Another option of installing FlashR and FlashGraphR is to download from Github and install them manually. The benefit of such an approach is to customize the installation process. For example, this allows us to compile the code in parallel.
$ git clone https://github.com/flashxio/FlashX.git
$ cd FlashX
$ mkdir -p build; cd build; cmake ../; make -j4; cd ..
$ R -e "install.packages('Rcpp', repos = 'http://cran.rstudio.com/')"
$ R -e "install.packages('RSpectra', repos = 'http://cran.rstudio.com/')"
$ R -e "install.packages('igraph', repos = 'http://cran.rstudio.com/')"
$ ./install_FlashR.sh
$ ./install_FlashGraphR.sh
Install FlashX in a docker container
If a user chooses to install FlashR and FlashGraphR in a docker container, the user needs to clone the FlashX repository and follows the steps below to install it.
$ git clone https://github.com/flashxio/FlashX.git
$ cd FlashX
$ docker build -t flashx docker
$ docker run -d flashx
$ docker exec -it <container id> bash
Run FlashR.
FlashR is designed to optimize for different hardware. If FlashR is installed
with libhwloc
, it adapts itself to different hardware automatically, from
a regular laptop (with a single processor) to a high-end server (with multiple
processors). For a machine with
SSDs, FlashR can utilize the SSDs to scale computation to very large datasets
if libaio
is installed.
Run FlashR in memory
If we run FlashR in memory and FlashR is installed with libhwloc
, we do not
need to configure FlashR at all and all computation in FlashR is parallelized
automatically.
However, if FlashR is not installed with libhwloc
, we can still maximize
the performance of FlashR by explicitly telling FlashR the number of processors
and the number of CPU cores in a machine. We configure FlashR with
fm.set.conf
as follows, by passing a configuration file.
Here
shows an example of the configuration file. To set the number of processors and
the number of threads, the important parameters here are num_nodes
and num_threads
.
For example, a machine with 4 processors and each with 12 CPU cores, we should
set num_nodes=4
and num_threads=48
. A complete list of parameters of FlashR can be
found here.
> fm.set.conf("path/to/conf/file")
Run FlashR with SSDs.
To run FlashR with SSDs, we need to specify the data directories for FlashR
with root_conf
in the configuration file as above (see an
example config file).
root_conf
accepts the path to
a text file or to a directory. If a machine has only one SSD or multiple SSDs
connected with a RAID controller, we can create a directory on the SSD(s),
and give the path to root_conf
. For example, if the SSDs are mounted on
/mnt/ssd
and we want to store FlashR data in /mnt/ssd/FlashR_data
, we set
root_conf=/mnt/ssd/FlashR_data
.
Please check here for more advanced configuration of a large SSD array on a large parallel machine.
NOTE: to run FlashR with SSDs, it is mandatory to install FlashR with libaio
.
Run an example in FlashR
FlashR implements the existing R matrix functions. As such, we can run existing R code with little modification. Here we show an example of creating a mixture of multivariant Gaussian and running k-means on it.
First, we create mvrnorm
, adapted from mvrnorm in the
MASS package
to create multivariant normal distribution. As shown
here,
we only need to modify two small places to run the function in FlashR.
We use this function to create the function mix.mvrnorm
that constructs
a dataset under a mixture of Gaussian distributions. mix.mvrnorm
creates m
normal distributions with different means and diagnoal covariance matrices and
combine them to construct a dataset.
We run k-means on the dataset to cluster data points into 10 clusters. fm.kmeans
outputs a vector, each of whose elements indicates the cluster id of a data point.
We run fm.table
to count the number of data points in each cluster.
library(FlashR)
mvrnorm <-
function(n = 1, mu, Sigma, tol=1e-6, empirical = FALSE, EISPACK = FALSE)
{
p <- length(mu)
if(!all(dim(Sigma) == c(p,p))) stop("incompatible arguments")
if(EISPACK) stop("'EISPACK' is no longer supported by R", domain = NA)
eS <- eigen(Sigma, symmetric = TRUE)
ev <- eS$values
if(!all(ev >= -tol*abs(ev[1L]))) stop("'Sigma' is not positive definite")
X <- fm.rnorm.matrix(n, p)
if(empirical) {
X <- scale(X, TRUE, FALSE) # remove means
X <- X %*% fm.svd(X, nu = 0)$v # rotate to PCs
X <- scale(X, FALSE, TRUE) # rescale PCs to unit variance
}
X <- drop(mu) + eS$vectors %*% diag(sqrt(pmax(ev, 0)), p) %*% t(X)
nm <- names(mu)
if(is.null(nm) && !is.null(dn <- dimnames(Sigma))) nm <- dn[[1L]]
dimnames(X) <- list(nm, NULL)
if(n == 1) drop(X) else t(X)
}
mix.mvrnorm <- function(n, p, m)
{
mats <- list()
for (i in 1:m)
mats <- c(mats, mvrnorm(n, runif(p), diag(runif(p))))
fm.rbind.list(mats)
}
> mat <- mix.mvrnorm(1000000, 10, 10)
> res <- fm.kmeans(mat, 10)
> cnt <- fm.table(res$cluster)
> as.vector(cnt@val)
[1] 1 2 3 4 5 6 7 8 9 10
> as.vector(cnt@Freq)
[1] 914957 1000803 982197 1058306 907314 957551 1060443 1101763 1065113
[10] 951553
Run FlashGraphR
Users can run graph algorithms provided by FlashGraphR.
Users can load a graph in both text edge list format and the FlashGraph format. If users provide a the text edge list format, FlashR will construct the FlashGraph format directly. e.g., both of the following commands loads the wiki-Vote graph to FlashR (assume the text edge list and the FlashGraph image are both in the current directory).
g <- fg.load.graph("./wiki-Vote.txt", directed=TRUE)
g <- fg.load.graph("./wiki-Vote.adj", "./wiki-Vote.index")
Here shows an example of running PageRank.
> library(FlashGraphR)
> fg.set.conf("flash-graph/conf/run_test.txt")
> g <- fg.load.graph("./wiki-Vote.txt", directed=TRUE)
> res <- fg.page.rank(g)
> res <- sort(as.vector(res), decreasing=FALSE, index.return=TRUE)
> tail(res$x, n=10)
[1] 6.354546 6.411633 6.703026 7.396998 7.483035 7.702387 9.680690
[8] 10.584648 10.879063 13.640081
> tail(res$ix, n=10)-1
[1] 5254 7553 4191 2237 2470 2398 2625 6634 15 4037
Launch FlashX Jupyter Notebook in EC2
To launch a FlashX Jupyter Notebook in EC2, a user first needs to install the boto3 library:
$ pip install boto3
Next, set up credentials (in e.g. ~/.aws/credentials):
[default]
aws_access_key_id = YOUR_KEY
aws_secret_access_key = YOUR_SECRET
After setting up boto3, a user can run the script to launch a FlashX Jupyter Notebook.
$ python create_instance.py
The script will print an IP address of the EC2 instance. Access the Jupyter Notebook with “http://ec2_ip:8888”. NOTE: an EC2 instance may take a few minutes to fully set up. A user should wait for a few minutes to access the Notebook.