Skip to content

PolicyAI API Integration Guide (v2)

Info

This guide covers the v2 API. For the legacy v1 API, see the PolicyAI v1 Integration Guide.

Overview

The PolicyAI API is simple to integrate into your app wherever you need to check for policy violations. PolicyAI will examine the provided content using our moderation LLM, identify if there was a policy violation, and give you back details about the violation so you can take action on the content. Common actions include taking the content down or flagging the associated user account for review by AiMod or a moderator.

With the API you can:

  • Create and edit versioned policies
  • Create labeling endpoints that reference your policies
  • Label content by sending it to a labeling endpoint

We also provide a UI where you can:

  • Sign up for an account
  • Create, edit, manage, and test your policies
  • Access your account details

PolicyAI URL

UI: https://policyai-v2.musubilabs.ai
API: https://api.musubilabs.ai/policyai/docs

Sign up for an account

To create an account, navigate to the UI where you'll be prompted to sign up!

After you've created an account, you can view your account details on the Account page of the UI, including your organization memberships.

Organizations

When you create an account, your own personal organization is created for you in our system. Organizations are used by PolicyAI to determine who has access to what. Any policies or evaluation results that you create using your personal organization ID are accessible only to you!

API Authentication

When you use the API directly, you can authenticate using your API key like this:

GET https://api.musubilabs.ai/policyai/api/version
accept: application/json
Musubi-Api-Key: [your API key here]

You can grab your API key from the API Settings page of the UI.

Create your Policy

You can create and edit policies and policy versions on the Manage Policies page of the UI.

Each policy has one or more versions, and for each version you can configure the policy text, policy settings, and output fields.

Use the Policy Converter

The Policy Converter can generate a complete policy from a brief description of what you want to moderate, or convert an existing policy into the LLM-optimized format described below. It uses AI to structure the policy, sharpen the language, and organize it into the correct format — saving you the effort of writing or reformatting it manually.

Policy text

Policy text uses a structured markdown format with two main sections: Instructions and Policies.

  • # Instructions — Cross-cutting guidance that applies across all categories, such as subversion detection, language handling, or content-type-specific rules.
  • # Policies — The violation categories. Each category is an ## heading with a description of what constitutes a violation. Include a **Not a violation:** list under a category to clarify exceptions. Always include an ## Other category at the bottom for content that doesn't clearly violate any policy.

Example policy text for moderating chat messages:

# Instructions

- Users may use leetspeak, emoji, or other techniques to evade filters.
  Look through these obfuscation attempts when evaluating content.
- If the violating meaning requires multiple logical leaps to interpret,
  default to allowing the content.

# Policies

## Selling

- Actively selling any service or product
- Stating a price for a service or product with intent to sell it

**Not a violation:**

- Mentioning a product or brand casually without intent to sell

## Scamming

- Tricking someone into sending money, crypto, or signing up for a fraudulent service
- Directing someone off-platform using suspicious contact info

**Not a violation:**

- Sharing personal contact info without suspicious context

## Drugs

- Selling or seeking regulated or controlled substances

## Other

Any content that does not CLEARLY violate one of the above policies is Other.

Labels are automatically extracted from the ## headings in the Policies section. Each heading becomes a label — headings named "Other" or "Safe" are mapped to a CLEAR assessment, and all others are mapped to FLAGGED.

Policy settings

PolicyAI supports customizing each policy in several ways.

Model

Choose the model that the policy will use to check for violations. We'll provide guidance on which model we expect to work best for you for a given application, and you can experiment on your datasets as well.

Content types

PolicyAI currently supports these kinds of content:

  • Text
  • Image
  • Text + Image combined

With video and audio support coming shortly as well. You can configure a policy to apply to only certain content types, which is useful when a policy is constructed and tested specifically for a specific content type. For example you may define one policy for user profile images and another for post text.

Testing your policy

Once you've put together a policy, test individual text or image content against it to make sure that it surfaces violations as you would expect.

To test your policy against a curated dataset, you can upload a test dataset on the Manage Datasets page of the UI, and run the policy against that dataset on the Test Policies page of the UI.

Setting up your labeling endpoint

Once your policy is ready, create a labeling endpoint so you can start labeling content against it.

Head to the Manage Labeling Endpoints page to create a new endpoint. Each endpoint references one or more policy labelers, identified by policy key and version number.

Tip

Create separate endpoints for each context where you'd want a different policy applied — for example, one for messages, one for profile bios, and one for event photos. You can also maintain separate endpoints per environment (dev, prod, staging).

Label your content

Once you have your endpoint ready, you can use it to label content via the API:

POST https://api.musubilabs.ai/policyai/v2/labels/byEndpoint/{organizationId}/{endpointName}/
Musubi-Api-Key: [your API key here]
Content-Type: application/json

{
    "content": [
        {
            "type": "TEXT",
            "text": "the content to check"
        }
    ]
}

The response contains a list of labels (one per labeler configured in the endpoint). Each label includes:

  • The assessment, either CLEAR or FLAGGED. FLAGGED means that a violation was detected.
  • The label assigned by the policy (e.g. the violated category).
  • Additional output fields from the labeler, such as the reason for the result.

At this point, you can incorporate these results into your system as needed 🥳