FlyNN: Fly Nearest Neighbor Classifier

February 18, 2022 · View on GitHub

Code for the "Federated Nearest Neighbor Classification with a Colony of Fruit-Flies" paper appearing at the AAAI 2022 conference on Artificial Intelligence. The camera ready version is available here (extended version with supplement is here). This slide deck describes our paper.

Table of contents:

Setting up environment
- Prerequisites
- Installing requirements
Running experiments
Citation

Setting up environment

This section details the setup of the compute environment for executing the provided scripts.

Prerequisites

python3.8
pip
virtualenv

Installing requirements

$ mkdir flynn
$ virtualenv -p /usr/bin/python3.8 flynn
$ source flynn/bin/activate
(flynn) $ export PYTHONPATH=`pwd`
(flynn) $ pip install --upgrade pip
(flynn) $ pip install -r requirements.txt

Running experiments

This section provides the precise commandline arguments for the various scripts to generate results for the different experiments conducted.

Running experiments on hyper-parameter dependence

Evaluation script options

(flynn) $ python test/eval_hpdep_kfold_small_data.py --help
usage: eval_hpdep_kfold_small_data.py [-h] [-t N_PARALLEL] [-F N_FOLDS] [-e EXP_FACTOR_LB]
                                      [-E EXP_FACTOR_UB] [-s CONN_SPAR_LB] [-S CONN_SPAR_UB]
                                      [-w WTA_NNZ_LB] [-W WTA_NNZ_UB] [-g GAMMA_LB] [-G GAMMA_UB]
                                      [-H {ef,cs,wn,gamma}] [-n NVALS_FOR_HP]

optional arguments:
  -h, --help            show this help message and exit
  -t N_PARALLEL, --n_parallel N_PARALLEL
                        Number of parallel workers
  -F N_FOLDS, --n_folds N_FOLDS
                        Number of folds
  -e EXP_FACTOR_LB, --exp_factor_lb EXP_FACTOR_LB
                        Lower bound on the expansion factor HP
  -E EXP_FACTOR_UB, --exp_factor_ub EXP_FACTOR_UB
                        Upper bound on the expansion factor HP
  -s CONN_SPAR_LB, --conn_spar_lb CONN_SPAR_LB
                        Lower bound on the connection sparsity HP
  -S CONN_SPAR_UB, --conn_spar_ub CONN_SPAR_UB
                        Upper bound on the connection sparsity HP
  -w WTA_NNZ_LB, --wta_nnz_lb WTA_NNZ_LB
                        Lower bound on the winner-take-all NNZ HP
  -W WTA_NNZ_UB, --wta_nnz_ub WTA_NNZ_UB
                        Upper bound on the winner-take-all NNZ HP
  -g GAMMA_LB, --gamma_lb GAMMA_LB
                        Lower bound for 'gamma' in the bloom filter
  -G GAMMA_UB, --gamma_ub GAMMA_UB
                        Upper bound for 'gamma' in the bloom filter
  -H {ef,cs,wn,gamma}, --hp2tune {ef,cs,wn,gamma}
                        The hyper-parameter to tune while fixing the rest
  -n NVALS_FOR_HP, --nvals_for_hp NVALS_FOR_HP
                        The number of values to try for the hyper-parameter

Exact commandline arguments

Set <NTHREADS> as per the resources (number of threads, memory) available.

FlyHash dimension `m`

(flynn) $ python test/eval_hpdep_kfold_small_data.py -t <NTHREADS> -F 10 -e 4 -E 2048 -s 0.1 -S 0.3 -w 8 -W 32 -g 0. -G 0.5 -H ef -n 10

Projection density `s`

(flynn) $ python test/eval_hpdep_kfold_small_data.py -t <NTHREADS> -F 10 -e 256 -E 1024 -s 0.1 -S 0.8 -w 8 -W 32 -g 0. -G 0.5 -H cs -n 10

FlyHash NNZ `\rho`

(flynn) $ python test/eval_hpdep_kfold_small_data.py -t <NTHREADS> -F 10 -e 256 -E 1024 -s 0.1 -S 0.3 -w 4 -W 256 -g 0. -G 0.5 -H wn -n 10

FlyNN decay rate `\gamma`

(flynn) $ python test/eval_hpdep_kfold_small_data.py -t <NTHREADS> -F 10 -e 256 -E 1024 -s 0.1 -S 0.3 -w 8 -W 32 -g 0.1 -G 0.8 -H gamma -n 10

Running comparison to baselines with synthetic data

Set the <NTHREADS> based on the compute resources available.

Synthetic binary data

We will be trying the following configurations:

<NDIMS> = 100, <NNZ_PER_ROW> = 10, <NSAMPLES> = 1000
<NDIMS> = 100, <NNZ_PER_ROW> = 20, <NSAMPLES> = 1000
<NDIMS> = 100, <NNZ_PER_ROW> = 30, <NSAMPLES> = 1000
<NDIMS> = 100, <NNZ_PER_ROW> = 40, <NSAMPLES> = 1000

Evaluation script options

(flynn) $ python test/eval_synthetic_binary_data.py --help
usage: eval_synthetic_binary_data.py [-h] [-t N_PARALLEL] [-F N_FOLDS] [-c MAX_CALLS]
                                     [-r N_RANDOM_STARTS] [-f EXP_FACTOR_UB]
                                     [-s CONN_SPAR_UB] [-w WTA_NNZ_UB] [-S N_ROWS]
                                     [-d N_COLS] [-L N_CLASSES] [-C N_CLUSTERS_PER_CLASS]
                                     [-W NNZ_PER_ROW] [-R N_REPS]

optional arguments:
  -h, --help            show this help message and exit
  -t N_PARALLEL, --n_parallel N_PARALLEL
                        Number of parallel workers
  -F N_FOLDS, --n_folds N_FOLDS
                        Number of folds
  -c MAX_CALLS, --max_calls MAX_CALLS
                        Maximum number of calls for GP
  -r N_RANDOM_STARTS, --n_random_starts N_RANDOM_STARTS
                        Number of random start points in GP
  -f EXP_FACTOR_UB, --exp_factor_ub EXP_FACTOR_UB
                        Upper bound on the expansion factor HP
  -s CONN_SPAR_UB, --conn_spar_ub CONN_SPAR_UB
                        Upper bound on the connection sparsity HP
  -w WTA_NNZ_UB, --wta_nnz_ub WTA_NNZ_UB
                        Upper bound on the winner-take-all ratio HP
  -S N_ROWS, --n_rows N_ROWS
                        Number of rows in the data set
  -d N_COLS, --n_cols N_COLS
                        Number of columns in the data set
  -L N_CLASSES, --n_classes N_CLASSES
                        Number of classes in the data set
  -C N_CLUSTERS_PER_CLASS, --n_clusters_per_class N_CLUSTERS_PER_CLASS
                        Number of clusters per class
  -W NNZ_PER_ROW, --nnz_per_row NNZ_PER_ROW
                        Number of NNZ per row in the binary data set
  -R N_REPS, --n_reps N_REPS
                        Number of repetitions

Exact commandline arguments

(flynn) $ python test/eval_synthetic_binary_data.py -t <NTHREADS> -F 10 -c 60 -r 15 -f 2048.0 -s 0.5 -w 256 -S <NSAMPLES> -d <NDIMS> -L 5 -C 3 -W <NNZ_PER_ROW> -R 30

Synthetic continuous data

We will be trying the following configurations:

<NDIMS> = 100, <NSAMPLES> = 1000

Evaluation script options

(flynn) $ python test/eval_synthetic_data.py --help
usage: eval_synthetic_data.py [-h] [-t N_PARALLEL] [-F N_FOLDS] [-c MAX_CALLS]
                              [-r N_RANDOM_STARTS] [-f EXP_FACTOR_UB] [-s CONN_SPAR_UB]
                              [-w WTA_NNZ_UB] [-S N_ROWS] [-d N_COLS] [-L N_CLASSES]
                              [-C N_CLUSTERS_PER_CLASS] [-R N_REPS]

optional arguments:
  -h, --help            show this help message and exit
  -t N_PARALLEL, --n_parallel N_PARALLEL
                        Number of parallel workers
  -F N_FOLDS, --n_folds N_FOLDS
                        Number of folds
  -c MAX_CALLS, --max_calls MAX_CALLS
                        Maximum number of calls for GP
  -r N_RANDOM_STARTS, --n_random_starts N_RANDOM_STARTS
                        Number of random start points in GP
  -f EXP_FACTOR_UB, --exp_factor_ub EXP_FACTOR_UB
                        Upper bound on the expansion factor HP
  -s CONN_SPAR_UB, --conn_spar_ub CONN_SPAR_UB
                        Upper bound on the connection sparsity HP
  -w WTA_NNZ_UB, --wta_nnz_ub WTA_NNZ_UB
                        Upper bound on the winner-take-all ratio HP
  -S N_ROWS, --n_rows N_ROWS
                        Number of rows in the data set
  -d N_COLS, --n_cols N_COLS
                        Number of columns in the data set
  -L N_CLASSES, --n_classes N_CLASSES
                        Number of classes in the data set
  -C N_CLUSTERS_PER_CLASS, --n_clusters_per_class N_CLUSTERS_PER_CLASS
                        Number of clusters per class
  -R N_REPS, --n_reps N_REPS
                        Number of repetitions

Exact commandline arguments

(flynn) $ python test/eval_synthetic_data.py -t <NTHREADS> -F 10 -c 60 -r 15 -f 2048.0 -s 0.5 -w 256 -S <NSAMPLES> -d <NDIMS> -L 5 -C 3 -R 30

Running comparison to baselines with OpenML data

OpenML data sets:

<MIN_DATA_DIM> = 20, <MAX_DATA_DIM> = 1000, <MAX_DATA_SAMPLES> = 20000

Based on resources available:

Set --n_parallel/-t <NTHREADS> based on number of threads <NTHREADS> available to process that many data sets in parallel.

Running `kNNC` baseline

Evaluation script options

(flynn) $ python test/knn_baseline.py --help
usage: knn_baseline.py [-h] [-t N_PARALLEL] [-F N_FOLDS] [-n MIN_DATA_DIM]
                       [-x MAX_DATA_DIM] [-S MAX_DATA_SAMPLES] [-K MAX_K]

optional arguments:
  -h, --help            show this help message and exit
  -t N_PARALLEL, --n_parallel N_PARALLEL
                        Number of parallel workers
  -F N_FOLDS, --n_folds N_FOLDS
                        Number of folds
  -n MIN_DATA_DIM, --min_data_dim MIN_DATA_DIM
                        Minimum data dimensionality on OpenML
  -x MAX_DATA_DIM, --max_data_dim MAX_DATA_DIM
                        Maximum data dimensionality on OpenML
  -S MAX_DATA_SAMPLES, --max_data_samples MAX_DATA_SAMPLES
                        Maximum number of samples in data on OpenML
  -K MAX_K, --max_k MAX_K
                        Maximum k for kNNC

Exact commandline arguments

(flynn) $ python test/knn_baseline.py -t <NTHREADS> -F 10 -n <MIN_DATA_DIM> -x <MAX_DATA_DIM> -S <MAX_DATA_SAMPLES> -K 1
(flynn) $ python test/knn_baseline.py -t <NTHREADS> -F 10 -n <MIN_DATA_DIM> -x <MAX_DATA_DIM> -S <MAX_DATA_SAMPLES>

Running `SBFC` baseline

Evaluation script options

(flynn) $ python test/sbf_baseline.py --help
usage: sbf_baseline.py [-h] [-t N_PARALLEL] [-F N_FOLDS] [-n MIN_DATA_DIM]
                       [-x MAX_DATA_DIM] [-S MAX_DATA_SAMPLES]
                       [-E EXPANSION_FACTOR_UB]

optional arguments:
  -h, --help            show this help message and exit
  -t N_PARALLEL, --n_parallel N_PARALLEL
                        Number of parallel workers
  -F N_FOLDS, --n_folds N_FOLDS
                        Number of folds
  -n MIN_DATA_DIM, --min_data_dim MIN_DATA_DIM
                        Minimum data dimensionality on OpenML
  -x MAX_DATA_DIM, --max_data_dim MAX_DATA_DIM
                        Maximum data dimensionality on OpenML
  -S MAX_DATA_SAMPLES, --max_data_samples MAX_DATA_SAMPLES
                        Maximum number of samples in data on OpenML
  -E EXPANSION_FACTOR_UB, --expansion_factor_ub EXPANSION_FACTOR_UB
                        Upper bound on the factor with which to project up

Exact commandline arguments

(flynn) $ python test/sbf_baseline.py -t <NTHREADS> -F 10 -n <MIN_DATA_DIM> -x <MAX_DATA_DIM> -S <MAX_DATA_SAMPLES> -E 2048.0

Running `FlyNN`

Evaluation script options

(flynn) $ python test/flynn_hpo.py --help
usage: flynn_hpo.py [-h] [-t N_PARALLEL] [-F N_FOLDS] [-c MAX_CALLS] [-r N_RANDOM_STARTS]
                    [-f EXP_FACTOR_UB] [-s CONN_SPAR_UB] [-w WTA_NNZ_UB] [-D DONE_SET_RE] [-N NON_BINARY]
                    [-n MIN_DATA_DIM] [-x MAX_DATA_DIM] [-S MAX_DATA_SAMPLES] [-B MAX_BATCH_SIZE]

optional arguments:
  -h, --help            show this help message and exit
  -t N_PARALLEL, --n_parallel N_PARALLEL
                        Number of parallel workers
  -F N_FOLDS, --n_folds N_FOLDS
                        Number of folds
  -c MAX_CALLS, --max_calls MAX_CALLS
                        Maximum number of calls for GP
  -r N_RANDOM_STARTS, --n_random_starts N_RANDOM_STARTS
                        Number of random start points in GP
  -f EXP_FACTOR_UB, --exp_factor_ub EXP_FACTOR_UB
                        Upper bound on the expansion factor HP
  -s CONN_SPAR_UB, --conn_spar_ub CONN_SPAR_UB
                        Upper bound on the projection density HP
  -w WTA_NNZ_UB, --wta_nnz_ub WTA_NNZ_UB
                        Upper bound on the winner-take-all NNZ HP
  -D DONE_SET_RE, --done_set_re DONE_SET_RE
                        Regex for the files corresponding to datasets already processed
  -N NON_BINARY, --non_binary NON_BINARY
                        Whether to use gamma=0 (binary, default) or gamma>0 (non-binary)
  -n MIN_DATA_DIM, --min_data_dim MIN_DATA_DIM
                        Minimum data dimensionality on OpenML
  -x MAX_DATA_DIM, --max_data_dim MAX_DATA_DIM
                        Maximum data dimensionality on OpenML
  -S MAX_DATA_SAMPLES, --max_data_samples MAX_DATA_SAMPLES
                        Maximum number of samples in data on OpenML
  -B MAX_BATCH_SIZE, --max_batch_size MAX_BATCH_SIZE
                        Maximum batch size for FlyNN training

Set --max_batch_size/-B <MAX_BATCH_SIZE> based on amount of memory available; large values lead to larger memory overheads but faster execution times.

Exact commandline arguments for `FlyNN, \gamma=0`

(flynn) $ python test/flynn_hpo.py -t <NTHREADS> -F 10 -c 60 -r 15 -f 2048.0 -s 0.5 -w 256 -n <MIN_DATA_DIM> -x <MAX_DATA_DIM> -S <MAX_DATA_SAMPLES>

Exact commandline arguments for `FlyNN, \gamma > 0`

Set --non_binary/-N to True to evaluate (non-binary) FlyNN; by default, the script evaluates (binary) FlyNN respectively. Specifically

(flynn) $ python test/flynn_hpo.py -t <NTHREADS> -F 10 -c 60 -r 15 -f 2048.0 -s 0.5 -w 256 -n <MIN_DATA_DIM> -x <MAX_DATA_DIM> -S <MAX_DATA_SAMPLES> -N True

Scaling with number of parties

Evaluation script options

(flynn) $ python test/scaling_nparties.py -h
usage: scaling_nparties.py [-h] [-t N_PARTIES] [-B MAX_BATCH_SIZE] [-R N_REPS]

optional arguments:
  -h, --help            show this help message and exit
  -t N_PARTIES, --n_parties N_PARTIES
                        Maximum number of parties
  -B MAX_BATCH_SIZE, --max_batch_size MAX_BATCH_SIZE
                        Maximum batch size
  -R N_REPS, --n_reps N_REPS
                        Number of repetitions

Set --max_batch_size/-B <MAX_BATCH_SIZE> based on amount of memory available; large values lead to larger memory overheads but faster execution times.

Exact commandline arguments

(flynn) $ python test/scaling_nparties.py -t 16 -B <MAX_BATCH_SIZE> -R 10

Differential privacy effect

Synthetic data

Evaluation script options

(flynn) $ python test/test_dp_flynn_syn.py --help
usage: test_dp_flynn_syn.py [-h] [-t NTHREADS] [-S NROWS] [-d NCOLS] [-L NCLASSES] [-C NCLUSTERS_PER_CLASS]
                            [-R NREPS] [-f EXP_FACTOR] [-k WTA_RATIO] [-c CONN_SPARSITY] [-r {geom,lin}]
                            [-T STEP_SIZE_FOR_T] [-e {non-private,all}]

optional arguments:
  -h, --help            show this help message and exit
  -t NTHREADS, --nthreads NTHREADS
                        Number of parallel workers
  -S NROWS, --nrows NROWS
                        Number of rows in the data set
  -d NCOLS, --ncols NCOLS
                        Number of columns in the data set
  -L NCLASSES, --nclasses NCLASSES
                        Number of classes in the data set
  -C NCLUSTERS_PER_CLASS, --nclusters_per_class NCLUSTERS_PER_CLASS
                        Number of clusters per class
  -R NREPS, --nreps NREPS
                        Number of repetitions
  -f EXP_FACTOR, --exp_factor EXP_FACTOR
                        the expansion factor HP
  -k WTA_RATIO, --wta_ratio WTA_RATIO
                        the WTA ratio
  -c CONN_SPARSITY, --conn_sparsity CONN_SPARSITY
                        the connection sparsity
  -r {geom,lin}, --rounds_option {geom,lin}
                        Values of #rounds to try
  -T STEP_SIZE_FOR_T, --step_size_for_T STEP_SIZE_FOR_T
                        Step size for T values
  -e {non-private,all}, --evaluation {non-private,all}
                        What methods to evaluate

Exact commandline arguments

We utilize all following

= 10000, 100000
= 30
= 10, 20
= 0.1
= 0.05, 0.1

(flynn) $ python test/test_dp_flynn_syn.py -t 2 -S <NROWS> -d <NCOLS> -L 2 -C 5 -R 10 -f <EF> -k <WR> -c <CS> -r lin -T 50 -e all

MNIST 3v8

Evaluation script options

(flynn) $ python test/test_dp_flynn_real.py --help
usage: test_dp_flynn_real.py [-h] [-t NTHREADS]
                             [-d {mnist,cifar10,fashion_mnist,cifar100,higgs,numerai28.6,connect-4,APSFailure}]
                             -L LABEL LABEL [-R NREPS] [-f EXP_FACTOR] [-k WTA_RATIO] [-c CONN_SPARSITY]
                             [-r {geom,lin}] [-T STEP_SIZE_FOR_T] [-e {non-private,all}]

optional arguments:
  -h, --help            show this help message and exit
  -t NTHREADS, --nthreads NTHREADS
                        Number of parallel workers
  -d {mnist,cifar10,fashion_mnist,cifar100,higgs,numerai28.6,connect-4,APSFailure}, --data_name {mnist,cifar10,fashion_mnist,cifar100,higgs,numerai28.6,connect-4,APSFailure}
                        The data set to evaluate on
  -L LABEL LABEL, --label LABEL LABEL
                        The label pair to extract to analyze a binary problem (use -1 for all labels)
  -R NREPS, --nreps NREPS
                        Number of repetitions
  -f EXP_FACTOR, --exp_factor EXP_FACTOR
                        the expansion factor HP
  -k WTA_RATIO, --wta_ratio WTA_RATIO
                        the WTA ratio
  -c CONN_SPARSITY, --conn_sparsity CONN_SPARSITY
                        the connection sparsity
  -r {geom,lin}, --rounds_option {geom,lin}
                        Values of #rounds to try
  -T STEP_SIZE_FOR_T, --step_size_for_T STEP_SIZE_FOR_T
                        Step size for T values
  -e {non-private,all}, --evaluation {non-private,all}
                        What methods to evaluate

Exact commandline arguments

We utilize all following

= 30
= 0.00625
= 0.1

(flynn) $ python test/test_dp_flynn_real.py -t 2 -d mnist -L 3 8 -R 10 -f <EF> -k <WR> -c <CS> -r lin -T 50 -e all

Citation

Please use the following citation for the paper:

Ram, Parikshit and Sinha, Kaushik. "Federated Nearest Neighbor Classification with a Colony of Fruit-Flies." To appear in the Proceedings of the 36th AAAI Conference on Artificial Intelligence. 2022.

@inproceedings{ram2022federated,
  title={Federated Nearest Neighbor Classification with a Colony of Fruit-Flies},
  author={Ram, Parikshit and Sinha, Kaushik},
  booktitle={To appear in Proceedings of the 36th AAAI Conference on Artificial Intelligence},
  year={2022}
}