What is PFA for?

Hardening a data analysis

ML/AI deployment has not gotten any easier. A recent Gartner report show(ed) only 53% of projects make it from artificial intelligence (AI) prototypes to production.

There are usually significant differences between environments for building models and environments for deploying models.

What is needed is a solution that is designed for portability, stability, safety, and security.

Modelers can build models in a Python development environment then deploy them to a pure Java environment with no code change with guaranteed

PFA offers

No malformed code
No insecure code
No malicious code
No required code change


Development: insight comes from exploratory tinkering.	Production: scalability comes from good design.

The Portable Format for Analytics (PFA) is a common language to help smooth the transition from development to production. PFA-enabled analysis tools can export machine learning or statistical models as JSON documents with a structure defined by the PFA specification. For instance, suppose a machine learning algorithm produces a classifier that has to be run in another application. If it produces that classifier in PFA format, any PFA-enabled application running on any system can execute it in a safe, controlled way.

Developer tools that speak PFA can deploy their inference engines on production environments that understand PFA. The only connection between the two worlds is the PFA document, a human-readable text file. In fact, this text file could have contributions from several statistical packages, or it could be modified by JSON-manipulating tools or by hand before it is delivered.

By contrast, scoring engines in custom formats present the system maintainers with three options: (a) try to install the data analyst’s tool across the production environment, including all of its dependencies, (b) port the algorithm and spend weeks chasing small (but compounding) numerical errors, and (c) dumb-down the analytic. None of these are good options.

Separation of concerns

PFA enables the safe deployment of models. Since inference engines written in PFA are not capable of accessing or manipulating their environment, they cannot jeopardize the production system. Data analysts can focus on the mathematical correctness of their algorithms and security reviews are only needed when the pipeline itself changes.

Tools such as Hadoop and Storm provide automated data pipelines, separating the data flow from the functions that are performed on data (mappers and reducers in Hadoop, spouts and bolts in Storm). Ordinarily, these functions are written in code that has access to the pipeline internals, the host operating system, the remote filesystem, the network, etc. However, all they should do is math.

PFA completes the abstraction by encapsulating these functions as PFA documents. From the point of view of the pipeline system, the documents are configuration files that may be loaded or replaced independently of the pipeline code.

This separation of concerns allows the data analysis to evolve independently of the pipeline. Since scoring engines written in PFA are not capable of accessing or manipulating their environment, they cannot jeopardize the production system. Data analysts can focus on the mathematical correctness of their algorithms and security reviews are only needed when the pipeline itself changes.

This decoupling is important because statistical and machine learning models usually change more quickly than applications and frameworks that run. Model details are often tweaked in response to discoveries about the data and models frequently need to be refreshed with new training samples.

Safe deployment applies to the action as well. This is important when critical target environments remain stable, deployment may have issues (such as pushes to edge devices) or there is concerns about the safety of the edge deceive itself. Also decoupling enables new forms of interaction with AI/ML such as codelss programing and persona assistants.

Just as the PFA inference engine is not capable of accessing or manipulating their environment, the underlying code that understands the PFA is untouched as new or updated inference engines are pushed out. Operations can focus on deploying targeted engines and not worry about the embedded application.

Flexibility and safety

As models push to the edge, new methods of model encapsulation are required. Edge devices require stricter execution parameters and safe code deployment. Traditionally, model deployment to the edge required customer code in non-machine learning oriented languages such as javascript, restricting what could be achieved. As edge device grew more powerful, with AI technology embedded in CPU architectures.

The Predictive Model Markup Language (PMML) was an attempt to bridge this gap by standardizing several of the most common kinds of scoring engines. Like PFA, PMML documents are intermediate text files (XML) produced by data analysis tools and consumed by an executable in the production environment. New functionality has been added to PMML over the past 17 years, but it is still based on tables of model parameters. Even a modest extension of a PMML inference engine requires a new version of PMML to be adopted, which can take years.

PFA serves this purpose with far more generality. Unlike PMML, PFA has control structures to direct program flow, a true type system for both model parameters and data, and its statistical functions are much more finely grained and can accept callbacks to modify their behavior. The author of a PFA document can construct new types of models from building blocks without waiting for the new model to be explicitly added to the specification.

PFA is more flexible than PMML, but safer than custom code. In the language of optimizations, it is the most flexible way to describe a scoring engine subject to the constraint that it won’t break the data pipeline.

Overview of PFA capabilities

The following contribute to PFA’s flexibility:

It has control structures, such as conditionals, loops, and user-defined functions (like a typical programming language).
It is entirely expressed within JSON, and can therefore be easily generated and manipulated by other programs. This is important because PFA documents are usually generated from training data by a statistical package or a machine learning algorithm.
Its library of functions is finely grained: multi-step processes are defined by chaining multiple functions. A user with a new type of model in mind can mix and match these library functions as needed.
Many library functions accept callbacks to further modify their behavior.
Scoring engines can share data or update external variables, such as entries in a database.

The following contribute to PFA’s safety:

It has strict numerical compatibility: the same PFA document and the same input results in the same output, regardless of platform.
The specification only defines functions that transform data. All inputs and outputs are controlled by the host system.
It has a type system that can be statically checked. Specifically, PFA types are Avro types (Avro is a data serialization format used to move data in several popular pipeline frameworks). This system has a type-safe null and PFA only performs type-safe casting, which ensure that missing data never cause run-time errors.
The callbacks that generalize PFA’s statistical models are not first-class functions. This means that the set of functions that a PFA document might call can be predicted before it runs. A PFA host may choose to only allow certain functions.
The semantics of shared data guarantee that data are never corrupted by concurrent access and scoring engines do not enter deadlock. The host can also statically determine which shared variables may be modified by a scoring engine, rather than at run-time.

To learn more, read the tutorials (which have interactive examples, so that you can see PFA in action) or the complete reference, which are linked in the sidebar or at the top of this page.

Portable Format for Analytics (PFA)

Motivation:

Interactive Tutorials:

References:

Specification:

Join the PFA Community:

What is PFA for?

Hardening a data analysis

Separation of concerns

Flexibility and safety

Overview of PFA capabilities