Solving the HAHA challenge

This script runs an instance of AutoML in the HAHA 2019 challenge. The full source code can be found here.

The dataset used is:

Dataset URL
HAHA 2019 https://www.fing.edu.uy/inco/grupos/pln/haha/index.html#data

Experimentation parameters

This experiment was run with the following parameters:

Parameter Value
Total epochs 1
Maximum iterations 10000
Timeout per pipeline 30 min
Global timeout -
Max RAM per pipeline 20 GB
Population size 50
Selection (k-best) 10
Early stop -

The experiments were run in the following hardware configurations (allocated indistinctively according to available resources):

Config CPU Cache Memory HDD
A 12 core Intel Xeon Gold 6126 19712 KB 191927.2MB 999.7GB
B 6 core Intel Xeon E5-1650 v3 15360 KB 32045.5MB 2500.5GB
C Quad core Intel Core i7-2600 8192 KB 15917.1MB 1480.3GB

Note

The hardware configuration details were extracted with inxi -CmD and summarized.

Relevant imports

Most of this example follows the same logic as the UCI example. First the necessary imports

from autogoal.ml import AutoML
from autogoal.datasets import haha
from autogoal.search import (
    PESearch,
    RichLogger,
)
from autogoal.kb import Seq, Sentence, VectorCategorical, Supervised
from autogoal.contrib import find_classes
from sklearn.metrics import f1_score

Next, we parse the command line arguments to configure the experiment.

Parsing arguments

The default values are the ones used for the experimentation reported in the paper.

import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--iterations", type=int, default=10000)
parser.add_argument("--timeout", type=int, default=60)
parser.add_argument("--memory", type=int, default=2)
parser.add_argument("--popsize", type=int, default=50)
parser.add_argument("--selection", type=int, default=10)
parser.add_argument("--global-timeout", type=int, default=None)
parser.add_argument("--examples", type=int, default=None)
parser.add_argument("--token", default=None)
parser.add_argument("--channel", default=None)

args = parser.parse_args()

print(args)

The next line will print all the algorithms that AutoGOAL found in the contrib library, i.e., anything that could be potentially used to solve an AutoML problem.

for cls in find_classes():
    print("Using: %s" % cls.__name__)

Experimentation

Instantiate the classifier. Note that the input and output types here are defined to match the problem statement, i.e., text classification.

classifier = AutoML(
    search_algorithm=PESearch,
    input=(Seq[Sentence], Supervised[VectorCategorical]),
    output=VectorCategorical,
    search_iterations=args.iterations,
    score_metric=f1_score,
    errors="warn",
    pop_size=args.popsize,
    search_timeout=args.global_timeout,
    evaluation_timeout=args.timeout,
    memory_limit=args.memory * 1024 ** 3,
)

loggers = [RichLogger()]

if args.token:
    from autogoal.contrib.telegram import TelegramLogger

    telegram = TelegramLogger(token=args.token, name=f"HAHA", channel=args.channel,)
    loggers.append(telegram)

Finally, loading the HAHA dataset, running the AutoML instance, and printing the results.

X_train, y_train, X_test, y_test = haha.load(max_examples=args.examples)

classifier.fit(X_train, y_train, logger=loggers)
score = classifier.score(X_test, y_test)

print(score)