Import and export data - Weights & Biases Documentation

Use the W&B Public APIs to move experiment data into and out of W&B. This guide shows you how to import runs from MLFlow, export run data for custom analysis, and update data on runs that you have already logged to W&B.

This feature requires python>=3.8.

Import data from MLFlow

Migrate your existing MLFlow tracking data to W&B so that you can continue analyzing past experiments alongside new ones. W&B supports importing data from MLFlow, including experiments, runs, artifacts, metrics, and other metadata. Install the importer dependencies, which provide the MLFlow integration:

# note: this requires py38+
pip install wandb[importers]

wandb login

Import all runs from an existing MLFlow server:

from wandb.apis.importers.mlflow import MlflowImporter

importer = MlflowImporter(mlflow_tracking_uri="...")

runs = importer.collect_runs()
importer.import_runs(runs)

By default, importer.collect_runs() collects all runs from the MLFlow server. To upload a specific subset instead, construct your own runs iterable and pass it to the importer:

import mlflow
from wandb.apis.importers.mlflow import MlflowRun

client = mlflow.tracking.MlflowClient(mlflow_tracking_uri)

runs: Iterable[MlflowRun] = []
for run in mlflow_client.search_runs(...):
    runs.append(MlflowRun(run, client))

importer.import_runs(runs)

If you import from Databricks MLFlow, you might need to configure the Databricks CLI first.Set mlflow-tracking-uri="databricks" in the previous step.

To skip importing artifacts, pass artifacts=False:

importer.import_runs(runs, artifacts=False)

To import to a specific W&B entity and project, pass a Namespace:

from wandb.apis.importers import Namespace

importer.import_runs(runs, namespace=Namespace(entity, project))

Export data

Use the Public API to export or update data that you have saved to W&B. Before using this API, log data from your script. For more details, see the Quickstart. Common use cases for the Public API include:

Export data: Download a dataframe for custom analysis in a Jupyter Notebook. After you have explored the data, you can sync your findings by creating a new analysis run and logging results, for example: wandb.init(job_type="analysis").
Update existing runs: Update the data logged in association with a W&B run. For example, you might want to update the config of a set of runs to include additional information, like the architecture or a hyperparameter that wasn’t originally logged.

See the Generated Reference Docs for details on available functions.

Create an API key

The Public API requires an API key to authenticate requests from your machine to W&B. To create an API key, select the Personal API key or Service Account API key tab for details.

Personal API key
Service account API key

To create a personal API key owned by your user ID:

Log in to W&B, click your user profile icon, then click User Settings.
Click Create new API key.
Provide a descriptive name for your API key.
Click Create.
Copy the displayed API key immediately and store it securely.

The full API key is only shown once at creation time. After you close the dialog, you cannot view the full API key again. Only the key ID (first part of the key) is visible in your settings. If you lose the full API key, you must create a new API key.

For secure storage options, see Store API keys securely.

Store and handle API keys securely

API keys provide access to your W&B account and should be protected like passwords. Follow these best practices:

Recommended storage methods

Secrets manager: Use a dedicated secrets management system such as AWS Secrets Manager, HashiCorp Vault, Azure Key Vault, or Google Secret Manager.
Password manager: Use a reputable password manager application.
OS-level keychains: Store keys in macOS Keychain, Windows Credential Manager, or Linux secret service. Not suggested for production.

What to avoid

Never commit API keys to version control systems such as Git.
Do not store API keys in plain text configuration files.
Do not pass API keys on the command line, because they will be visible in the output of OS commands like ps.
Avoid sharing API keys through email, chat, or other unencrypted channels.
Do not hard-code API keys in your source code.

If an API key is exposed, delete the API key from your W&B account immediately and contact support or your AISE.

Environment variables

When using API keys in your code, pass them through environment variables:

export WANDB_API_KEY="your-api-key-here"

This approach keeps keys out of your source code and makes it easier to rotate them when needed.

Avoid setting the environment variable in line with the command, because it will be visible in the output of OS commands like ps:

# Avoid this pattern, which can expose the API key in process managers
export WANDB_API_KEY="your-api-key-here" ./my-script.sh

SDK version compatibility

New API keys are longer than legacy keys. When authenticating with older versions of the wandb or weave SDKs, you may encounter an API key length error. Solution: Update to a newer SDK version:

wandb SDK v0.22.3+
```
pip install --upgrade wandb==0.22.3
```
weave SDK v0.52.17+
```
pip install --upgrade weave==0.52.17
```

If you cannot upgrade the SDK immediately, set the API key using the WANDB_API_KEY environment variable as a workaround.

Find the run path

Most Public API calls identify a run by its run path, which has the form <entity>/<project>/<run_id>. To find the run path, open a run page in the app UI and click the Overview tab.

Export run data

Download data from a finished or active run so that you can analyze it outside of W&B. Common usage includes downloading a dataframe for custom analysis in a Jupyter notebook, or using custom logic in an automated environment.

import wandb

api = wandb.Api()
run = api.run("<entity>/<project>/<run_id>")

The most-used attributes of a run object are:

Attribute	Meaning
`run.config`	A dictionary of the run’s configuration information, such as the hyperparameters for a training run or the preprocessing methods for a run that creates a dataset Artifact. Think of these as the run’s inputs.
`run.history()`	A list of dictionaries meant to store values that change while the model is training such as loss. The command `run.log()` appends to this object.
`run.summary`	A dictionary of information that summarizes the run’s results. This can be scalars like accuracy and loss, or large files. By default, `run.log()` sets the summary to the final value of a logged time series. The contents of the summary can also be set directly. Think of the summary as the run’s outputs.

You can also modify or update the data of past runs. By default, a single instance of an api object caches all network requests. If your use case requires real-time information in a running script, call api.flush() to get updated values.

Run attributes

The following code snippet shows how to create a run, log some data, and then access the run’s attributes:

import wandb
import random

with wandb.init(project="public-api-example") as run:
    n_epochs = 5
    config = {"n_epochs": n_epochs}
    run.config.update(config)
    for n in range(run.config.get("n_epochs")):
        run.log(
            {"val": random.randint(0, 1000), "loss": (random.randint(0, 1000) / 1000.00)}
        )

The following sections show the outputs for the run object attributes accessed in the previous example.

`run.config`

{"n_epochs": 5}

`run.summary`

{
    "_step": 4,
    "_timestamp": 1644345412,
    "_wandb": {"runtime": 3},
    "loss": 0.041,
    "val": 525,
}

Sample size

The default history method samples the metrics to a fixed number of samples (the default is 500, which you can change with the samples argument). To export all of the data on a large run, use the run.scan_history() method. For more details, see the API Reference.

Query multiple runs

Use the following examples to fetch and filter multiple runs at once, which is useful when you want to compare runs in a project or aggregate results across an experiment.

DataFrame and CSVs
MongoDB Style

This example script finds a project and outputs a CSV of runs with name, configs, and summary stats. Replace <entity> and <project> with your W&B entity and the name of your project, respectively.

import pandas as pd
import wandb

api = wandb.Api()
entity, project = "<entity>", "<project>"
runs = api.runs(entity + "/" + project)

summary_list, config_list, name_list = [], [], []
for run in runs:
    # .summary contains output keys/values for
    # metrics such as accuracy.
    #  We call ._json_dict to omit large files
    summary_list.append(run.summary._json_dict)

    # .config contains the hyperparameters.
    #  We remove special values that start with _.
    config_list.append({k: v for k, v in run.config.items() if not k.startswith("_")})

    # .name is the human-readable name of the run.
    name_list.append(run.name)

runs_df = pd.DataFrame(
    {"summary": summary_list, "config": config_list, "name": name_list}
)

runs_df.to_csv("project.csv")

run.finish()

The W&B API also lets you query across runs in a project with api.runs(). A common use case is exporting runs data for custom analysis. The query interface is the same as the one MongoDB uses.

runs = api.runs(
    "username/project",
    {"$or": [{"config.experiment_name": "foo"}, {"config.experiment_name": "bar"}]},
)
print(f"Found {len(runs)} runs")

Calling api.runs returns a Runs object that is iterable and acts like a list. By default, the object loads 50 runs at a time in sequence as required. To change the number loaded per page, use the per_page keyword argument. api.runs also accepts an order keyword argument. The default order is -created_at. To order results in ascending order, specify +created_at. You can also sort by config or summary values. For example, summary.val_acc or config.experiment_name.

Error handling

If errors occur while communicating with W&B servers, W&B raises a wandb.CommError. To introspect the original exception, use the exc attribute.

Get the latest git commit through the API

To see the latest git commit in the UI, click a run and then click the Overview tab on the run page. The commit is also in the file wandb-metadata.json. Using the public API, get the git hash with run.commit.

Get a run’s name and ID during a run

After calling wandb.init(), you can access the random run ID or the human-readable run name from your script as follows:

Unique run ID (8-character hash): run.id
Random run name (human readable): run.name

To set useful identifiers for your runs, follow these recommendations:

Run ID: Leave it as the generated hash. The run ID must be unique across runs in your project.
Run name: Use something short, readable, and preferably unique so that you can tell the difference between lines on your charts.
Run notes: A good place for a short description of what you’re doing in your run. Set this with wandb.init(notes="your notes here").
Run tags: Track things dynamically in run tags, and use filters in the UI to filter your table down to only the runs you care about. Set tags from your script and then edit them in the UI, both in the runs table and the Overview tab of the run page. For details, see Tag runs.

Public API examples

The following sections show common patterns for working with the Public API, including reading and filtering runs, updating run data, and downloading files.

Export data to visualize in matplotlib or seaborn

For common export patterns, see the API examples. To download a CSV from your browser, click the download button on a custom plot or on the expanded runs table.

Read metrics from a run

This example outputs timestamp and accuracy saved with run.log({"accuracy": acc}) for a run saved to "<entity>/<project>/<run_id>".

import wandb

api = wandb.Api()

run = api.run("<entity>/<project>/<run_id>")
if run.state == "finished":
    for i, row in run.history().iterrows():
        print(row["_timestamp"], row["accuracy"])

Filter runs

To filter runs, use the MongoDB Query Language.

Date

runs = api.runs(
    "<entity>/<project>",
    {"$and": [{"created_at": {"$lt": "YYYY-MM-DDT##", "$gt": "YYYY-MM-DDT##"}}]},
)

Read specific metrics from a run

To retrieve specific metrics from a run, use the keys argument. The default number of samples when using run.history() is 500. Logged steps that don’t include a specific metric appear in the output dataframe as NaN. The keys argument causes the API to sample steps that include the listed metric keys more frequently.

import wandb

api = wandb.Api()

run = api.run("<entity>/<project>/<run_id>")
if run.state == "finished":
    for i, row in run.history(keys=["accuracy"]).iterrows():
        print(row["_timestamp"], row["accuracy"])

Compare two runs

This example outputs the config parameters that differ between run1 and run2.

import pandas as pd
import wandb

api = wandb.Api()

# replace with your <entity>, <project>, and <run_id>
run1 = api.run("<entity>/<project>/<run_id>")
run2 = api.run("<entity>/<project>/<run_id>")


df = pd.DataFrame([run1.config, run2.config]).transpose()

df.columns = [run1.name, run2.name]
print(df[df[run1.name] != df[run2.name]])

Outputs:

              c_10_sgd_0.025_0.01_long_switch base_adam_4_conv_2fc
batch_size                                 32                   16
n_conv_layers                               5                    4
optimizer                             rmsprop                 adam

Update metrics for a run, after the run has finished

This example sets the accuracy of a previous run to 0.9. It also modifies the accuracy histogram of a previous run to be the histogram of numpy_array.

import wandb

api = wandb.Api()

run = api.run("<entity>/<project>/<run_id>")
run.summary["accuracy"] = 0.9
run.summary["accuracy_histogram"] = wandb.Histogram(numpy_array)
run.summary.update()

Rename a metric in a completed run

This example renames a summary column in your tables.

import wandb

api = wandb.Api()

run = api.run("<entity>/<project>/<run_id>")
run.summary["new_name"] = run.summary["old_name"]
del run.summary["old_name"]
run.summary.update()

Renaming a column only applies to tables. Charts still refer to metrics by their original names.

Update config for an existing run

This example updates one of your configuration settings.

import wandb

api = wandb.Api()

run = api.run("<entity>/<project>/<run_id>")
run.config["key"] = updated_value
run.update()

Export system resource consumptions to a CSV file

The following snippet finds the system resource consumptions and saves them to a CSV.

import wandb

with wandb.Api().run("<entity>/<project>/<run_id>") as run:

    system_metrics = run.history(stream="events")
    system_metrics.to_csv("sys_metrics.csv")

Get unsampled metric data

When you retrieve data from history, by default it’s sampled to 500 points. To get all the logged data points, use run.scan_history(). The following example downloads all the loss data points logged in history.

import wandb

api = wandb.Api()

run = api.run("<entity>/<project>/<run_id>")
history = run.scan_history()
losses = [row["loss"] for row in history]

Get paginated data from history

If the backend fetches metrics slowly or API requests time out, lower the page size in scan_history so that individual requests don’t time out. The default page size is 500. Experiment with different sizes to see what works best:

import wandb

api = wandb.Api()

run = api.run("<entity>/<project>/<run_id>")
run.scan_history(keys=sorted(cols), page_size=100)

Export metrics from all runs in a project to a CSV file

This script downloads the runs in a project and produces a dataframe and a CSV of runs including their names, configs, and summary stats. Replace <entity> and <project> with your W&B entity and the name of your project, respectively.

import pandas as pd
import wandb

api = wandb.Api()
entity, project = "<entity>", "<project>"
runs = api.runs(entity + "/" + project)

summary_list, config_list, name_list = [], [], []
for run in runs:
    # .summary contains the output keys/values
    #  for metrics such as accuracy.
    #  We call ._json_dict to omit large files
    summary_list.append(run.summary._json_dict)

    # .config contains the hyperparameters.
    #  We remove special values that start with _.
    config_list.append({k: v for k, v in run.config.items() if not k.startswith("_")})

    # .name is the human-readable name of the run.
    name_list.append(run.name)

runs_df = pd.DataFrame(
    {"summary": summary_list, "config": config_list, "name": name_list}
)

runs_df.to_csv("project.csv")

Get the starting time for a run

This code snippet retrieves the time at which the run was created.

import wandb

api = wandb.Api()

run = api.run("entity/project/run_id")
start_time = run.created_at

Upload files to a finished run

The following code snippet uploads a selected file to a finished run.

import wandb

api = wandb.Api()

run = api.run("entity/project/run_id")
run.upload_file("file_name.extension")

Download a file from a run

This finds the file “model-best.h5” associated with run ID uxte44z7 in the cifar project and saves it locally.

import wandb

api = wandb.Api()

run = api.run("<entity>/<project>/<run_id>")
run.file("model-best.h5").download()

Download all files from a run

This finds all files associated with a run and saves them locally.

import wandb

api = wandb.Api()

run = api.run("<entity>/<project>/<run_id>")
for file in run.files():
    file.download()

Get runs from a specific sweep

This snippet downloads all the runs associated with a particular sweep.

import wandb

api = wandb.Api()

sweep = api.sweep("<entity>/<project>/<sweep_id>")
sweep_runs = sweep.runs

Get the best run from a sweep

The following snippet gets the best run from a given sweep.

import wandb

api = wandb.Api()

sweep = api.sweep("<entity>/<project>/<sweep_id>")
best_run = sweep.best_run()

The best_run is the run with the best metric as defined by the metric parameter in the sweep config.

Download the best model file from a sweep

This snippet downloads the model file with the highest validation accuracy from a sweep with runs that saved model files to model.h5.

import wandb

api = wandb.Api()

sweep = api.sweep("<entity>/<project>/<sweep_id>")
runs = sorted(sweep.runs, key=lambda run: run.summary.get("val_acc", 0), reverse=True)
val_acc = runs[0].summary.get("val_acc", 0)
print(f"Best run {runs[0].name} with {val_acc}% val accuracy")

runs[0].file("model.h5").download(replace=True)
print("Best model saved to model-best.h5")

Delete all files with a given extension from a run

This snippet deletes files with a given extension from a run.

import wandb

api = wandb.Api()

run = api.run("<entity>/<project>/<run_id>")

extension = ".png"
files = run.files()
for file in files:
    if file.name.endswith(extension):
        file.delete()

Download system metrics data

This snippet produces a dataframe with all the system resource consumption metrics for a run and then saves it to a CSV.

import wandb

api = wandb.Api()

run = api.run("<entity>/<project>/<run_id>")
system_metrics = run.history(stream="events")
system_metrics.to_csv("sys_metrics.csv")

Update summary metrics

To update summary metrics, pass a dictionary.

summary.update({"key": val})

Get the command that ran the run

Each run captures the command that launched it on the run overview page. To retrieve this command from the API, run the following:

import wandb

api = wandb.Api()

run = api.run("<entity>/<project>/<run_id>")

meta = json.load(run.file("wandb-metadata.json").download())
program = ["python"] + [meta["program"]] + meta["args"]

​Import data from MLFlow

​Export data

​Create an API key

​Store and handle API keys securely

​Recommended storage methods

​What to avoid

​Environment variables

​SDK version compatibility

​Find the run path

​Export run data

​Run attributes

​run.config

​run.summary

​Sample size

​Query multiple runs

​Error handling

​Get the latest git commit through the API

​Get a run’s name and ID during a run

​Public API examples

​Export data to visualize in matplotlib or seaborn

​Read metrics from a run

​Filter runs

​Date

​Read specific metrics from a run

​Compare two runs

​Update metrics for a run, after the run has finished

​Rename a metric in a completed run

​Update config for an existing run

​Export system resource consumptions to a CSV file

​Get unsampled metric data

​Get paginated data from history

​Export metrics from all runs in a project to a CSV file

​Get the starting time for a run

​Upload files to a finished run

​Download a file from a run

​Download all files from a run

​Get runs from a specific sweep

​Get the best run from a sweep

​Download the best model file from a sweep

​Delete all files with a given extension from a run

​Download system metrics data

​Update summary metrics

​Get the command that ran the run

Import data from MLFlow

Export data

Create an API key

Store and handle API keys securely

Recommended storage methods

What to avoid

Environment variables

SDK version compatibility

Find the run path

Export run data

Run attributes

`run.config`

`run.summary`

Sample size

Query multiple runs

Error handling

Get the latest git commit through the API

Get a run’s name and ID during a run

Public API examples

Export data to visualize in matplotlib or seaborn

Read metrics from a run

Filter runs

Date

Read specific metrics from a run

Compare two runs

Update metrics for a run, after the run has finished

Rename a metric in a completed run

Update config for an existing run

Export system resource consumptions to a CSV file

Get unsampled metric data

Get paginated data from history

Export metrics from all runs in a project to a CSV file

Get the starting time for a run

Upload files to a finished run

Download a file from a run

Download all files from a run

Get runs from a specific sweep

Get the best run from a sweep

Download the best model file from a sweep

Delete all files with a given extension from a run

Download system metrics data

Update summary metrics

Get the command that ran the run