How to contribute to FlashX

TODO list for the R interface

TODO list for the Python interface

TODO list for additional graph analysis algorithms

  • Louvain clustering
  • very efficient shortest path search

TODO list for additional machine learning algorithms

  • SVM
  • elastic nets
  • random forest
  • gradient boosting
  • isomap and many other manifold learning algorithms
  • stochastic gradient descent
  • deep neural network

Explore more advanced/experimental machine learning algorithms

  • Randomized algorithms: Although FlashX supports very large datasets by storing data on SSDs, the computation or the I/O bandwidth from SSDs to CPU might still be the bottleneck. Randomized algorithms can help us to reduce computation or data movement or both to significantly achieve performance while retaining similar accuracy. Examples are random SVD and randomized Newton method.

TODO list for the FlashX framework

  • Implement an R compiler to translate the R code in user-defined functions passed to generalized matrix operations. The generalized matrix operations require user-defined functions to perform actual matrix computation on matrices. Currently, the user-defined functions have to be implemented in C++. We need to allow users to implement the user-defined functions in R and compile them into low-level representations that run in the underlying system of FlashR. The goal is to improve generality of the FlashR framework and still achieve very high performance (close to the efficient C implementations).
  • Test FlashX on different Linux distributions.
  • Port FlashX to Mac and Windows.
  • Implement external-memory parallel sorting
  • Optimize most of matrix operations on sparse matrices.
  • Re-implement libaio functions with non-blocking I/O.
  • Implement data loading functions from different data sources (local files, S3, URL) and different formats (cvs, binary, csr, edge list, GraphML, etc).
  • Measure the performance of FlashX in different hardware (laptop/desktop/servers with RAM/disks/SSDS/NVM, cloud instance with SSDs).