Class-based API¶
AutoGOAL's class-based API allows to automatically find optimal instances of complex objects in user-defined class hierarchies that solve a given task. A task is simply some method that evaluates an object's performance. The solution space is defined by a class hierarchy and all possible ways of combining instances of different types, and creating them with different parameters.
Note
The following code requires sklearn
dependencies. Read the dependencies section for more information.
For example, suppose we want to build the best possible classifier in scikit-learn
for a given dataset.
Let's begin with a simple classification problem.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
We use make_classification
to create a toy classification problem.
X, y = make_classification(random_state=0) # Fixed seed for reproducibility
One first idea is to use a specific algorithm, such as Logistic Regression, to solve this problem. Since the nature of these problems is stochastic, we need to train in one subset, test on another, and perform a sensible number of evaluations to actually know if this is any good.
from sklearn.linear_model import LogisticRegression
def evaluate(estimator, iters=30):
scores = []
for i in range(iters):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
estimator.fit(X_train, y_train)
scores.append(estimator.score(X_test, y_test))
return sum(scores) / len(scores)
lr = LogisticRegression()
score = evaluate(lr) # around 0.83
print(score) # :hide:
So far so good, but maybe we could do better with a different set of parameters. Logistic regression has at least two parameters that influence heavily its performance: the penalty function and the regularization strength.
Instead of writing a loop through a bunch of different parameters, we can use AutoGOAL to automatically explore the space of possible combinations. We can do this with the class-based API by providing annotations for the parameters we want to explore.
Trying multiple logistic regressions¶
First we import some annotation types from AutoGOAL
from autogoal.grammar import ContinuousValue, CategoricalValue
Next we annotate the parameters we want to explore.
Since we cannot modify the class LogisticRegression
we will inherit from it.
class LR(LogisticRegression):
def __init__(
self, penalty: CategoricalValue("l1", "l2"), C: ContinuousValue(0.1, 10)
):
super().__init__(penalty=penalty, C=C, solver="liblinear")
The penalty: Categorical("l1", "l2")
annotation tells AutoGOAL that for this class the
parameter penalty
can take values from a list of predefined values. Likewise the
C: Continuous(0.1, 10)
annotation indicates that the parameter C
can take a float value in a specified range.
Now we will use AutoGOAL to automatically generate different instances of our LR
class.
With the class-based API we achieve this by building a context-free grammar that describes all possible instances.
from autogoal.grammar import generate_cfg
grammar = generate_cfg(LR)
print(grammar)
<LR> := LR (penalty=<LR_penalty>, C=<LR_C>)
<LR_penalty> := categorical (options=['l1', 'l2'])
<LR_C> := continuous (min=0.1, max=10)
As you can see, this grammar describes the set of all possible instances of LR
by
describing how to call the constructor, and how to generate random values for its parameters.
Note
Formally, this a called a Context-free grammar. They are used in Computer Science to describe formal languages,
such as programming languages, mathematical expresions, etc.
Context-free grammars work by describing a set of replacement rules that you can apply recursively to
construct a string of a specific language. In this case we are using grammars to describe the language
of all possible Python codes that instantiates an LR
.
You can read more in Wikipedia.
You can use this grammar to generate a bunch of random instances.
for _ in range(5):
print(grammar.sample())
LR(C=4.015231900472649, penalty='l2')
LR(C=9.556786605505499, penalty='l2')
LR(C=4.05716261883461, penalty='l1')
LR(C=3.2786487445120858, penalty='l1')
LR(C=4.655510386502897, penalty='l2')
Now we can search for the best combination of constructor parameters by trying a bunch of different instances and see which one obtains the best score. AutoGOAL also has tools for automating this process.
from autogoal.search import RandomSearch
search = RandomSearch(grammar, evaluate, random_state=0) # Fixed seed
best, score = search.run(100)
print("Best:", best, "\nScore:", score)
The RandomSearch
will try 100 different random instances, and for each one
run the evaluate
method we defined earlier. It returns the best one and the corresponding score.
Best: LR(C=0.7043201482743121, penalty='l1')
Score: 0.8853333333333337
So we can do a little bit better by carefully selecting the right parameters. However, maybe we can do even better.
Trying different algorithms¶
To continue this line of thought, maybe we could do better with a different classifier. We could try decision trees, support vector machines, naive bayes, and many more. Here is the first time AutoGOAL can come to our aid. Instead of writing ourselves a loop through all the possible classes, we can do the following.
First, we import everything we need.
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
Now that we have all the classes we want to try, we have to tell AutoGOAL that there is something
to optimize.
We start by defining a space of possible parameters that we want to tune for each of these
classes. Like with LR
, we will wrap these classes in our own to provide the corresponding annotations.
class SVM(SVC):
def __init__(
self,
kernel: CategoricalValue("rbf", "linear", "poly"),
C: ContinuousValue(0.1, 10),
):
super().__init__(C=C, kernel=kernel)
class DT(DecisionTreeClassifier):
def __init__(self, criterion: CategoricalValue("gini", "entropy")):
super().__init__(criterion=criterion)
class NB(GaussianNB):
def __init__(self, var_smoothing: ContinuousValue(1e-10, 0.1)):
super().__init__(var_smoothing=var_smoothing)
Next, we use AutoGOAL to construct a grammar for the union of the possible instances of each of these clases.
from autogoal.grammar import Union
from autogoal.grammar import generate_cfg
grammar = generate_cfg(Union("Classifier", LR, SVM, NB, DT))
Note
The method generate_cfg
works not only with annotated classes
but also with plain methods, or anything that has a __call__
and suitable annotations.
This grammar defines all possible ways to obtain a Classifier
, which is basically
by instantiating one of the classes we gave it with a suitable value for each parameter.
We can test it by generating a few of them.
print(grammar)
<Classifier> := <LR> | <SVM> | <NB> | <DT>
<LR> := LR (penalty=<LR_penalty>, C=<LR_C>)
<LR_penalty> := categorical (options=['l1', 'l2'])
<LR_C> := continuous (min=0.1, max=10)
<SVM> := SVM (kernel=<SVM_kernel>, C=<SVM_C>)
<SVM_kernel> := categorical (options=['rbf', 'linear', 'poly'])
<SVM_C> := continuous (min=0.1, max=10)
<NB> := NB (var_smoothing=<NB_var_smoothing>)
<NB_var_smoothing> := continuous (min=1e-10, max=0.1)
<DT> := DT (criterion=<DT_criterion>)
<DT_criterion> := categorical (options=['gini', 'entropy'])
Note
The constructor for Union
requires as first parameter a name
so that in the grammar
a suitable production can be defined. Think of it as the name of an abstract class that
groups all your classes, just there is no actual type ever created, it's just for organizational purposes.
for _ in range(5):
print(grammar.sample())
NB(var_smoothing=0.04620465447733762)
DT(criterion='gini')
SVM(C=3.2914771222720116, kernel='rbf')
LR(C=7.809744923904822, penalty='l1')
DT(criterion='gini')
Now that we have a bunch of possible algorithms, let's see which one is best.
search = RandomSearch(grammar, evaluate, random_state=0)
best, score = search.run(100)
print("Best:", best, "\nScore:", score)
Best: NB(var_smoothing=0.08450775758264377)
Score: 0.8840000000000003
So it doesn't really seem that we can do much better, which is unsurprising given that we are only doing a random search (there are better search methods in AutoGOAL), and this is a toy problem which basically any algorithm can solve fairly well.
However, to continue with the example, now that we know how to optimize any given grammar, what is interesting is can we increase the complexity of our pipeline by adding more and more layers and steps to it, to solve more challenging problems.
Adding more steps¶
To illustrate how to build more complex pipelines, let's change our focus to a bit more challenging problem: sentiment analysis. We will use the ultra-know movie reviews corpus as a testbed in the next few examples.
from autogoal.datasets import movie_reviews
To solve sentiment analysis we need to add a step before the actual classification in order to get
feature matrices from text. The simplest solution is to use a vectorizer from scikit-learn
.
There are two options to choose from.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
The CountVectorizer
class has many parameters that we might want to tune, but
in this example we are interested only in trying different n-gram combinations.
Hence, we will wrap CountVectorizer
in our own Count
class, and redefine its constructor
to receive an ngram
parameter. We annotate this parameter with :Discrete(1,3)
to
indicate that the possible values are integers in the interval [1,3]
.
from autogoal.grammar import DiscreteValue
class Count(CountVectorizer):
def __init__(self, ngram: DiscreteValue(1, 3)):
super().__init__(ngram_range=(1, ngram))
self.ngram = ngram
Note
The reason why we store ngram
in the __init__()
method is
for documentation purposes, so that when we call print()
we get to see the actual parameters that where selected.
This works automatically for parameters that are named exactly as sklearn
parameters, because their __repr__
takes care, but for parameters which we
introduce we need to store them in the instance so that __repr__
works.
Now we will do the same with the TfIdfVectorizer
class, but this time we also want to
explore automatically whether enabling or disabling use_idf
is better.
We will use the Boolean
annotation in this case.
from autogoal.grammar import BooleanValue
class TfIdf(TfidfVectorizer):
def __init__(self, ngram: DiscreteValue(1, 3), use_idf: BooleanValue()):
super().__init__(ngram_range=(1, ngram), use_idf=use_idf)
self.ngram = ngram
Besides vectorization, another common step in NLP pipelines is dimensionality reduction.
For dimensionality reduction, we want to either use singular value decomposition,
or nothing at all. The implementation of TruncatedSVD
is suitable here because it
provides a fast and scalable approximation to SVDs when dealing with spare matrices.
As before, we want to parameterize the end dimension, so we will use :Discrete(50,200)
,
i.e., if we reduce at all, reduce between 50
and 200
dimensions.
We will use the Discrete
annotation in this case.
from sklearn.decomposition import TruncatedSVD
class SVD(TruncatedSVD):
def __init__(self, n: DiscreteValue(50, 200)):
super().__init__(n_components=n)
self.n = n
To disable dimensionality reduction in some pipelines, it's not correct to simply pass a None
object. That would raise an exception. Instead, we make use of the
Null Object design pattern
and provide a "no-op" implementation that simply passes through the values.
class Noop:
def fit_transform(self, X, y=None):
return X
def transform(self, X, y=None):
return X
def __repr__(self):
return "Noop()"
Note
Technically, we could use "passtrough"
as an argument to the Pipeline
class that we will
use below and achieve the same result. However, this approach is more general and clean, since
it doesn't rely on the underlying API providing us with an implementation of the Null Object pattern.
Now that we have all of the necessary classes with their corresponding parameters
correctly annotated, it's time to put it all together into a pipeline. We will
inherit from sklearn
's own implementation of Pipeline
, because we want to fix
the actual steps that are gonna be used.
Just as before, out initializer declares the parameters. In this case, we
want a vectorizer, a decomposer and a classifier. To tell autogoal
to try
different classes for the same parameter we use the Union
annotation.
Likewise, just as before, we have to call the base initializer, this time passing the
corresponding configuration for an
sklearn
pipeline.
from sklearn.pipeline import Pipeline as _Pipeline
class Pipeline(_Pipeline):
def __init__(
self,
vectorizer: Union("Vectorizer", Count, TfIdf),
decomposer: Union("Decomposer", Noop, SVD),
classifier: Union("Classifier", LR, SVM, DT, NB),
):
self.vectorizer = vectorizer
self.decomposer = decomposer
self.classifier = classifier
super().__init__(
[("vec", vectorizer), ("dec", decomposer), ("cls", classifier),]
)
Once everything is in place, we can tell autogoal
to automatically infer a grammar
for all the possible combinations of parameters and clases that we can use.
The root of our grammar is the Pipeline
class we just defined. The method generate_cfg
does exactly that, taking a class and building a context free grammar to construct
that class, based on the parameters' annotations and recursively building the corresponding
rules for all classes down to basic parameter types.
grammar = generate_cfg(Pipeline)
Notice how the grammar specifies all the possible ways to build a Pipeline
,
both considering the different implementations we have for vectorizers, decomposers and classifiers;
as well as their corresponding parameters. Our grammar is fairly simple because we only have
two levels of recursion, Pipeline and its parameters; but this same process can be applied to any
hierarchy of any complexity, including circular references.
print(grammar)
<Pipeline> := Pipeline (vectorizer=<Vectorizer>, decomposer=<Decomposer>, classifier=<Classifier>)
<Vectorizer> := <Count> | <TfIdf>
<Count> := Count (ngram=<Count_ngram>)
<Count_ngram> := discrete (min=1, max=3)
<TfIdf> := TfIdf (ngram=<TfIdf_ngram>, use_idf=<TfIdf_use_idf>)
<TfIdf_ngram> := discrete (min=1, max=3)
<TfIdf_use_idf> := boolean ()
<Decomposer> := <Noop> | <SVD>
<Noop> := Noop ()
<SVD> := SVD (n=<SVD_n>)
<SVD_n> := discrete (min=50, max=200)
<Classifier> := <LR> | <SVM> | <DT> | <NB>
<LR> := LR (penalty=<LR_penalty>, C=<LR_C>)
<LR_penalty> := categorical (options=['l1', 'l2'])
<LR_C> := continuous (min=0.1, max=10)
<SVM> := SVM (kernel=<SVM_kernel>, C=<SVM_C>)
<SVM_kernel> := categorical (options=['rbf', 'linear', 'poly'])
<SVM_C> := continuous (min=0.1, max=10)
<DT> := DT (criterion=<DT_criterion>)
<DT_criterion> := categorical (options=['gini', 'entropy'])
<NB> := NB (var_smoothing=<NB_var_smoothing>)
<NB_var_smoothing> := continuous (min=1e-10, max=0.1)
Now we can start to see the power of the class-based API. Just with a few annotations in the same classes that we anyway have to write, we automatically obtain a computational representation (a grammar) that knows how to build infinitely many of these instances. Futhermore, this works with any level of complexity, whether our classes receive simple arguments (such as integers, floats, strings) or instances of other classes, and so on.
Let's take a look at how different pipelines can be generated with this grammar by sampling 10 random pipelines.
for _ in range(10):
print(grammar.sample())
You should see something like this, but your exact pipelines will be different due to random sampling.
Pipeline(classifier=SVM(C=4.09762837283166, kernel='rbf'), decomposer=Noop(),
vectorizer=Count(ngram=1))
Pipeline(classifier=DT(criterion='entropy'), decomposer=Noop(),
vectorizer=Count(ngram=2))
Pipeline(classifier=LR(C=5.309978916527087, penalty='l2'), decomposer=Noop(),
vectorizer=Count(ngram=3))
Pipeline(classifier=LR(C=9.776994352626533, penalty='l1'), decomposer=Noop(),
vectorizer=TfIdf(ngram=2, use_idf=True))
Pipeline(classifier=SVM(C=5.973033047496386, kernel='rbf'),
decomposer=SVD(n=197), vectorizer=Count(ngram=3))
Pipeline(classifier=NB(var_smoothing=0.07941925220053651),
decomposer=SVD(n=183), vectorizer=TfIdf(ngram=3, use_idf=False))
Pipeline(classifier=DT(criterion='entropy'), decomposer=SVD(n=144),
vectorizer=TfIdf(ngram=1, use_idf=True))
Pipeline(classifier=SVM(C=6.052775609636756, kernel='poly'),
decomposer=SVD(n=160), vectorizer=TfIdf(ngram=1, use_idf=True))
Pipeline(classifier=DT(criterion='entropy'), decomposer=Noop(),
vectorizer=Count(ngram=1))
Pipeline(classifier=DT(criterion='entropy'), decomposer=Noop(),
vectorizer=TfIdf(ngram=3, use_idf=True))
Finding the best pipeline¶
To continue with the example, we will now search for the best pipeline.
We will evaluate our pipelines on the movie_reviews
corpus. For that purpose
we need a fitness function, which is a simple callable that takes a pipeline and outputs
a score. Fortunately, the movie_reviews.make_fn
function does this for us, taking
care of train/test splitting, fitting a pipeline in the training set and computing
the accuracy on the test set.
fitness_fn = movie_reviews.make_fn(examples=100)
The RandomSearch
strategy simply calls grammar.sample()
a bunch of times
and stores the best performing pipeline. It has no intelligence whatsoever,
but it serves as a good baseline implementation.
We will run it for a total of 1000
fitness evaluations, or equivalently, a total
of 1000
different random pipelines.
random_search = RandomSearch(grammar, fitness_fn, random_state=0)
best, score = random_search.run(1000)
Note
For reproducibility purposes we can pass a fixed random seed in random_state
.
Final remarks¶
We only used scikit-learn
here for illustrative purposes, but you can apply this strategy to any problem
whose solution consists of exploring a large space of complex class instances interrelated with each other.
Also, in this example we have manually written wrappers for scikit-learn
classes to provide the necessary
annotations. However, specifically for scikit-learn
, we already provide a bunch of wrappers with suitable
annotations in autogoal.contrib.sklearn
.
We also only use RandomSearch
in this example because the focus is on defining the pipelines.
However, the autogoal.search
namespace contains other search strategies that perform
much better than plain random sampling.