January 13, 2020

568 words 3 mins read

rapidsai/cuml

cuML - RAPIDS Machine Learning Library


repo name	rapidsai/cuml
repo link	https://github.com/rapidsai/cuml
homepage
language	C++
size (curr.)	49300 kB
stars (curr.)	1166
created	2018-10-11
license	Apache License 2.0

cuML - GPU Machine Learning Algorithms

cuML is a suite of libraries that implement machine learning algorithms and mathematical primitives functions that share compatible APIs with other RAPIDS projects.

cuML enables data scientists, researchers, and software engineers to run traditional tabular ML tasks on GPUs without going into the details of CUDA programming. In most cases, cuML’s Python API matches the API from scikit-learn.

For large datasets, these GPU-based implementations can complete 10-50x faster than their CPU equivalents. For details on performance, see the cuML Benchmarks Notebook.

As an example, the following Python snippet loads input and computes DBSCAN clusters, all on GPU:

import cudf
from cuml.cluster import DBSCAN

# Create and populate a GPU DataFrame
gdf_float = cudf.DataFrame()
gdf_float['0'] = [1.0, 2.0, 5.0]
gdf_float['1'] = [4.0, 2.0, 1.0]
gdf_float['2'] = [4.0, 2.0, 1.0]

# Setup and fit clusters
dbscan_float = DBSCAN(eps=1.0, min_samples=1)
dbscan_float.fit(gdf_float)

print(dbscan_float.labels_)

Output:

0    0
1    1
2    2
dtype: int32

cuML also features multi-GPU and multi-node-multi-GPU operation, using Dask, for a growing list of algorithms. The following Python snippet reads input from a CSV file and performs a NearestNeighbors query across a cluster of Dask workers, using multiple GPUs on a single node:

# Create a Dask CUDA cluster w/ one worker per device
from dask_cuda import LocalCUDACluster
cluster = LocalCUDACluster()

# Read CSV file in parallel across workers
import dask_cudf
df = dask_cudf.read_csv("/path/to/csv")

# Fit a NearestNeighbors model and query it
from cuml.dask.neighbors import NearestNeighbors
nn = NearestNeighbors(n_neighbors = 10)
nn.fit(df)
neighbors = nn.kneighbors(df)

For additional examples, browse our complete API documentation, or check out our introductory walkthrough notebooks. Finally, you can find complete end-to-end examples in the notebooks-contrib repo.

Supported Algorithms

Category	Algorithm	Notes
Clustering	Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
	K-Means	Multi-node multi-GPU via Dask
Dimensionality Reduction	Principal Components Analysis (PCA)	Multi-node multi-GPU via Dask
	Truncated Singular Value Decomposition (tSVD)	Multi-node multi-GPU via Dask
	Uniform Manifold Approximation and Projection (UMAP)
	Random Projection
	t-Distributed Stochastic Neighbor Embedding (TSNE)
Linear Models for Regression or Classification	Linear Regression (OLS)
	Linear Regression with Lasso or Ridge Regularization
	ElasticNet Regression
	Logistic Regression
	Stochastic Gradient Descent (SGD), Coordinate Descent (CD), and Quasi-Newton (QN) (including L-BFGS and OWL-QN) solvers for linear models
Nonlinear Models for Regression or Classification	Random Forest (RF) Classification	Experimental multi-node multi-GPU via Dask
	Random Forest (RF) Regression	Experimental multi-node multi-GPU via Dask
	Inference for decision tree-based models	Forest Inference Library (FIL)
	K-Nearest Neighbors (KNN)	Multi-node multi-GPU via Dask, uses Faiss for Nearest Neighbors Query.
	K-Nearest Neighbors (KNN) Classification
	K-Nearest Neighbors (KNN) Regression
	Support Vector Machine Classifier (SVC)
	Epsilon-Support Vector Regression (SVR)
Time Series	Linear Kalman Filter
	Holt-Winters Exponential Smoothing
	Auto-regressive Integrated Moving Average (ARIMA)	Supports seasonality (SARIMA)

Installation

See the RAPIDS Release Selector for the command line to install either nightly or official release cuML packages via Conda or Docker.

Build/Install from Source

See the build guide.

Contributing

Please see our guide for contributing to cuML.

Contact

Find out more details on the RAPIDS site

Open GPU Data Science

The RAPIDS suite of open source software libraries aim to enable execution of end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposing that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.

rapidsai/cuml

cuML - GPU Machine Learning Algorithms

Supported Algorithms

Installation

Build/Install from Source

Contributing

Contact

Open GPU Data Science

google/mediapipe

interpretml/interpret

neo-ai/neo-ai-dlr

facebookresearch/flashlight

facebookresearch/TensorComprehensions

apple/turicreate

aksnzhy/xlearn

tensorflow/tensorflow

catboost/catboost