Solving the MEDDOCAN challenge¶
This script runs an instance of AutoML
in the MEDDOCAN 2019 challenge.
The full source code can be found here.
Dataset | URL |
---|---|
MEDDOCAN 2019 | https://github.com/PlanTL-SANIDAD/SPACCC_MEDDOCAN |
Experimentation parameters¶
This experiment was run with the following parameters:
Parameter | Value |
---|---|
Total epochs | 1 |
Maximum iterations | 10000 |
Timeout per pipeline | 30 min |
Global timeout | - |
Max RAM per pipeline | 20 GB |
Population size | 50 |
Selection (k-best) | 10 |
Early stop | - |
The experiments were run in the following hardware configurations (allocated indistinctively according to available resources):
Config | CPU | Cache | Memory | HDD |
---|---|---|---|---|
A | 12 core Intel Xeon Gold 6126 | 19712 KB | 191927.2MB | 999.7GB |
B | 6 core Intel Xeon E5-1650 v3 | 15360 KB | 32045.5MB | 2500.5GB |
C | Quad core Intel Core i7-2600 | 8192 KB | 15917.1MB | 1480.3GB |
Note
The hardware configuration details were extracted with inxi -CmD
and summarized.
Relevant imports¶
Most of this example follows the same logic as the UCI example. First the necessary imports
from autogoal.ml import AutoML
from autogoal.datasets import meddocan
from autogoal.search import (
RichLogger,
PESearch,
)
from autogoal.kb import *
Parsing arguments¶
Next, we parse the command line arguments to configure the experiment.
The default values are the ones used for the experimentation reported in the paper.
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--iterations", type=int, default=10000)
parser.add_argument("--timeout", type=int, default=1800)
parser.add_argument("--memory", type=int, default=20)
parser.add_argument("--popsize", type=int, default=50)
parser.add_argument("--selection", type=int, default=10)
parser.add_argument("--global-timeout", type=int, default=None)
parser.add_argument("--examples", type=int, default=None)
parser.add_argument("--token", default=None)
parser.add_argument("--channel", default=None)
args = parser.parse_args()
print(args)
Experimentation¶
Instantiate the classifier. Note that the input and output types here are defined to match the problem statement, i.e., entity recognition.
from autogoal.contrib import find_classes
classifier = AutoML(
search_algorithm=PESearch,
input=(Seq[Seq[Word]], Supervised[Seq[Seq[Label]]]),
output=Seq[Seq[Label]],
registry=find_classes(exclude="Keras|Bert"),
search_iterations=args.iterations,
score_metric=meddocan.F1_beta,
cross_validation_steps=1,
pop_size=args.popsize,
search_timeout=args.global_timeout,
evaluation_timeout=args.timeout,
memory_limit=args.memory * 1024 ** 3,
)
Basic logging configuration.
loggers = [RichLogger()]
if args.token:
from autogoal.contrib.telegram import TelegramLogger
telegram = TelegramLogger(token=args.token, name=f"MEDDOCAN", channel=args.channel,)
loggers.append(telegram)
Finally, loading the MEDDOCAN dataset, running the AutoML
instance,
and printing the results.
X_train, y_train, X_test, y_test = meddocan.load(max_examples=args.examples)
classifier.fit(X_train, y_train, logger=loggers)
score = classifier.score(X_test, y_test)
print(score)