Motivation

There is often a debate amongst data scientists on which tool is the best for a certain predictive modeling task. Some lean towards a less steep learning curve, while some advocate tools based on ease of development, and yet some more, on their coverage of algorithms. There are also Solutions and Dev-Ops engineers who would throw the “ease-of-deployment” angle into the debate.

The vast variety of tools out there, and the choices engineers have on which tool to use, actually kill the aspect of operability for big data projects. The effect is much more pronounced due to the additional variance in machine learning tools. DataTorrent RTS attempts to solve some of the operability issues and reduces time to market, with the help of PMML as a first step.

Introduction – What is PMML?

PMML stands for Predictive Model Markup Language and is an XML-based model interchange format for predictive machine learning models. PMML provides a way for analytic applications to describe and exchange predictive models produced by data mining and machine learning algorithms.

Most machine learning tools have already adopted the PMML standard, and allow, at the very least, an export of the model into PMML format. All the products which have adopted the standard are listed here. With time, we expect the PMML footprint will continue to grow.

Side Note: What are Predictive Machine Learning Models?

Machine learning algorithms which try to exploit information and patterns gained from historical data and are used to predict the future outcomes in the new data are often called predictive models. Common examples of such algorithms are classification and regression. Classification problems, as the name implies, applies to problems which deal with data with multiple different classes, trying to figure out to which class a new datum belongs. Classification of an email into either Spam or Not Spam is a classic example of such a problem. Regression problems on the other hand deal with predicting an actual value out of a continuous domain. An example of a regression problem would be to predict the price of a house given the locality and the features of the house.

Why PMML?

Data scientists are familiar with the mathematical model which an algorithm uses. For example, for a machine learning model like Naive Bayes classification, a data scientist knows that the content of the model is nothing but prior and conditional probabilities learnt from the training data which is fed to the algorithm. However, each tool that implements this algorithm (or equivalently allows the creation of a Naive Bayes model) may represent the model in its own format and data structures. Often times there is an implicit assumption that the prediction (usually also known as the scoring – see note below) will also happen on the same tool which was used for training. This assumption was more or less true until the big data revolution. However, even though the big data world had tools which allowed training and scoring on big data, the fact of the matter is that most of the time different tools are used for training and scoring. The reason is that these two phases are dominated by two very different enterprise roles – the data scientists and the data engineers. Both roles are responsible for different phases of a Machine Learning project and have different focus points. A data scientist has the responsibility of making the model as accurate as possible, while the data engineer has the responsibility of making sure the model is deployed in the production environment and runs smoothly without any hiccups.

But, the choice of tools and different roles was not the only reason for this divide. Another important fact to note is that training data is always scarce. Most of the time this data is manually generated. Very few projects go for a crowd-sourced method of collecting the training data, but even so, this data needs to be manually/semi-manually verified and most of the time does not qualify the definition of big data. For this reason, most training can happen on traditional non-big data tools. Hence, in spite of tools which allow training on big data (like Apache Mahout / Apache Spark) or streaming pipelines (like Apache Apex / DataTorrent RTS), some even allowing streaming machine learning (like Apache SAMOA), most of the training stays as a one-time thing as before and is done on traditional tools of the data scientist’s choice. The big data tools are used but mostly for scoring. In reality, the data which needs to be ‘scored’ is the real big data as opposed to the training data.

This difference in the quantity of training and prediction data created different needs for processing systems for these phases. Naturally, most use cases out there want to do training on their traditional tools but would like to do prediction (scoring) on modern big data systems.

Another important side effect is that the process of training, although a complicated one, runs on a small data set and hence is more apt for a batch job. In contrast, scoring is usually an ongoing thing which uses the generated model and runs indefinitely for all the new data that comes into the pipeline. This is most apt for a streaming pipeline as opposed to a batch job.

In conclusion, most of the time the data scientists are reluctant to work on other systems since they are more comfortable with some tools while sometimes the training algorithm is too complicated to be translated into a different language. All these factors seem to have contributed to the creation and adoption of the PMML standard.

Side Note: Machine Learning Jargons!

Machine learning is the process of creating a software (often mathematical) model based on the historical data provided to it. The process of creating the model is referred to as training the model. Similarly, the data that goes into the creation of the model is referred to as the training data.

Once the model is created, it can be used to predict the results for new data. Prediction can be in multiple ways. An example of such a prediction is to identify whether an incoming transaction is a fraudulent transaction or not or to identify the price of the given house. Such a process of predicting the result is often called as prediction, evaluation, or scoring.

DataTorrent RTS – Machine Scoring Operator

DataTorrent RTS provides a scoring operator which can be used to score/predict incoming data using a PMML model. The only input the operator takes is the PMML file. It can identify the algorithm used to train the model by looking at the file and instantiate an appropriate scorer which can understand the PMML format for that algorithm.

The use of the PMML operator is illustrated below:

DataTorrent RTS also provides a fraud prevention application which uses a large set of pre-configured rules which can be used to identify and prevent fraud transactions.

Imagine now, that the customer finds the set of rules configured in the application inadequate, due to which there are actual fraud transactions passing through undetected. One of the solutions, in this case, could be to complement the existing pipeline with a machine scoring operator which can predict fraud, based on some machine learning model. If the customer can export the machine learning model (which is created offline) into a PMML format, the machine scoring operator can directly be used for this functionality without changing a lot of the existing pipeline. Of course, there might be a preceding operator that needs to format the input data as needed by the scoring operator. But this is highly advantageous over other approaches and takes care of the production nightmares of reimplementation of the scoring logic.

As mentioned before, DataTorrent RTS tries to minimize the operational hurdles of the customer by making it as seamless as possible. With the PMML operator deployed in production, it becomes straightforward for the user to update the model. This just requires an update of the PMML model file on HDFS. Of course, there is much more that goes into the development/update of the existing model, but this is regarding the deployment of that model in production which can be done in a seamless manner without any downtime or operational hiccups.

An operable machine scoring ETL pipeline processing incoming data in real-time enables enterprises to leverage their machine learning investment to get a business outcome. The data generated by the real-time pipeline developed by data engineers is available for data scientists in their next cycle of machine learning batch job, thus completing the loop. An operable and productizable setup enables enterprises to score multiple machine learning models in a pipeline as independent data services. Various other operational and productization aspects of this machine scoring framework including: the ability to stage model scoring; ability to compare models; ability to visualize metrics; ability to monitor; ability to take real-time decisions based on multiple machine learning model scores will be covered in later blog(s). These features are part of DataTorrent’s Apoxi™ framework, and some of these big data services are available in DataTorrent AppFactory today.

Side Note: What is the DataTorrent Fraud Prevention Solution?

DataTorrent’s Omni-Channel Payment Fraud Prevention Application, in fact, does a lot of complex event processing by passing each record through a series of business rules and triggering one or more actions. It is a typical example of CEP which leverages a rules engine for the purpose of optimizing the process of rule application and trigger management.

A Sample PMML File

A sample PMML file can be seen at http://dmg.org/pmml/v4-3/NaiveBayes.html.
This example model essentially represents a Naive Bayes classification model as can be seen from the NaiveBayesModel tag in the xml. The model tries to predict the amount of insurance claims for a car given some of its attributes like age of the owner, gender, number of claims, the domicile and the age of the car. This information can be identified from the MiningSchema tag in the xml. The rest of the tags like BayesInput and BayesOutput actually represent the mathematical model for Naive Bayes. On the face of it, it does not look like a classification but more of a regression. However, if we look at the amount of claims field which is the predicted field in the model, it is discretized based on a set of ranges – 100, 500, etc. See the model page for more information on other tags in the XML format and the method of scoring using the PMML xml file.

<PMML xmlns="http://www.dmg.org/PMML-4_3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="4.3">
    <Header copyright="Copyright (c) 2013, DMG.org"/>
    <DataDictionary numberOfFields="6">
      <DataField name="age of individual" optype="continuous" dataType="double"/>
      <DataField name="gender" optype="categorical" dataType="string">
        <Value value="female"/>
        <Value value="male"/>
      </DataField>
      <DataField name="no of claims" optype="categorical" dataType="string">
        <Value value="0"/>
        <Value value="1"/>
        <Value value="2"/>
        <Value value=">2"/>
      </DataField>
      <DataField name="domicile" optype="categorical" dataType="string">
        <Value value="suburban"/>
        <Value value="urban"/>
        <Value value="rural"/>
      </DataField>
      <DataField name="age of car" optype="continuous" dataType="double"/>
      <DataField name="amount of claims" optype="categorical" dataType="integer">
        <Value value="100"/>
        <Value value="500"/>
        <Value value="1000"/>
        <Value value="5000"/>
        <Value value="10000"/>
      </DataField>
    </DataDictionary>
    <NaiveBayesModel modelName="NaiveBayes Insurance" functionName="classification" threshold="0.001">
      <MiningSchema>
        <MiningField name="age of individual"/>
        <MiningField name="gender"/>
        <MiningField name="no of claims"/>
        <MiningField name="domicile"/>
        <MiningField name="age of car"/>
        <MiningField name="amount of claims" usageType="target"/>
      </MiningSchema>
      <BayesInputs>
        <BayesInput fieldName="age of individual">
          <TargetValueStats>
            <TargetValueStat value="  100">
              <GaussianDistribution mean="32.006" variance="0.352"/>
            </TargetValueStat>	
            <TargetValueStat value="  500">
              <GaussianDistribution mean="24.936" variance="0.516"/>
            </TargetValueStat>   
            <TargetValueStat value=" 1000">
              <GaussianDistribution mean="24.588" variance="0.635"/>
            </TargetValueStat>   
            <TargetValueStat value=" 5000">
              <GaussianDistribution mean="24.428" variance="0.379"/>
            </TargetValueStat>  
            <TargetValueStat value="10000">
              <GaussianDistribution mean="24.770" variance="0.314"/>
            </TargetValueStat>   
          </TargetValueStats>         
        </BayesInput>
        <BayesInput fieldName="gender">
          <PairCounts value="male">
            <TargetValueCounts>
              <TargetValueCount value="100" count="4273"/>
              <TargetValueCount value="500" count="1321"/>
              <TargetValueCount value="1000" count="780"/>
              <TargetValueCount value="5000" count="405"/>
              <TargetValueCount value="10000" count="42"/>
            </TargetValueCounts>
          </PairCounts>
          <PairCounts value="female">
            <TargetValueCounts>
              <TargetValueCount value="100" count="4325"/>
              <TargetValueCount value="500" count="1212"/>
              <TargetValueCount value="1000" count="742"/>
              <TargetValueCount value="5000" count="292"/>
              <TargetValueCount value="10000" count="48"/>
            </TargetValueCounts>
          </PairCounts>
        </BayesInput>
        <BayesInput fieldName="no of claims">
          <PairCounts value="0">
            <TargetValueCounts>
              <TargetValueCount value="100" count="4698"/>
              <TargetValueCount value="500" count="623"/>
              <TargetValueCount value="1000" count="1259"/>
              <TargetValueCount value="5000" count="550"/>
              <TargetValueCount value="10000" count="40"/>
            </TargetValueCounts>
          </PairCounts>
          <PairCounts value="1">
            <TargetValueCounts>
              <TargetValueCount value="100" count="3526"/>
              <TargetValueCount value="500" count="1798"/>
              <TargetValueCount value="1000" count="227"/>
              <TargetValueCount value="5000" count="152"/>
              <TargetValueCount value="10000" count="40"/>
            </TargetValueCounts>
          </PairCounts>
          <PairCounts value="2">
            <TargetValueCounts>
              <TargetValueCount value="100" count="225"/>
              <TargetValueCount value="500" count="10"/>
              <TargetValueCount value="1000" count="9"/>
              <TargetValueCount value="5000" count="0"/>
              <TargetValueCount value="10000" count="10"/>
            </TargetValueCounts>
          </PairCounts>
          <PairCounts value=">2">
            <TargetValueCounts>
              <TargetValueCount value="100" count="112"/>
              <TargetValueCount value="500" count="5"/>
              <TargetValueCount value="1000" count="1"/>
              <TargetValueCount value="5000" count="1"/>
              <TargetValueCount value="10000" count="8"/>
            </TargetValueCounts>
          </PairCounts>
        </BayesInput>
        <BayesInput fieldName="domicile">
          <PairCounts value="suburban">
            <TargetValueCounts>
              <TargetValueCount value="100" count="2536"/>
              <TargetValueCount value="500" count="165"/>
              <TargetValueCount value="1000" count="516"/>
              <TargetValueCount value="5000" count="290"/>
              <TargetValueCount value="10000" count="42"/>
            </TargetValueCounts>
          </PairCounts>
          <PairCounts value="urban">
            <TargetValueCounts>
              <TargetValueCount value="100" count="1679"/>
              <TargetValueCount value="500" count="792"/>
              <TargetValueCount value="1000" count="511"/>
              <TargetValueCount value="5000" count="259"/>
              <TargetValueCount value="10000" count="30"/>
            </TargetValueCounts>
          </PairCounts>
          <PairCounts value="rural">
            <TargetValueCounts>
              <TargetValueCount value="100" count="2512"/>
              <TargetValueCount value="500" count="1013"/>
              <TargetValueCount value="1000" count="442"/>
              <TargetValueCount value="5000" count="137"/>
              <TargetValueCount value="10000" count="21"/>
            </TargetValueCounts>
          </PairCounts>
        </BayesInput>
        <BayesInput fieldName="age of car">
          <DerivedField optype="categorical" dataType="string">
            <Discretize field="age of car">
              <DiscretizeBin binValue="0">
                <Interval closure="closedOpen" leftMargin="0" rightMargin="1"/>
              </DiscretizeBin>
              <DiscretizeBin binValue="1">
                <Interval closure="closedOpen" leftMargin="1" rightMargin="5"/>
              </DiscretizeBin>
              <DiscretizeBin binValue="2">
                <Interval closure="closedOpen" leftMargin="5"/>
              </DiscretizeBin>
            </Discretize>
          </DerivedField>
          <PairCounts value="0">
            <TargetValueCounts>
              <TargetValueCount value="100" count="927"/>
              <TargetValueCount value="500" count="183"/>
              <TargetValueCount value="1000" count="221"/>
              <TargetValueCount value="5000" count="50"/>
              <TargetValueCount value="10000" count="10"/>
            </TargetValueCounts>
          </PairCounts>
          <PairCounts value="1">
            <TargetValueCounts>
              <TargetValueCount value="100" count="830"/>
              <TargetValueCount value="500" count="182"/>
              <TargetValueCount value="1000" count="51"/>
              <TargetValueCount value="5000" count="26"/>
              <TargetValueCount value="10000" count="6"/>
            </TargetValueCounts>
          </PairCounts>
          <PairCounts value="2">
            <TargetValueCounts>
              <TargetValueCount value="100" count="6251"/>
              <TargetValueCount value="500" count="1901"/>
              <TargetValueCount value="1000" count="919"/>
              <TargetValueCount value="5000" count="623"/>
              <TargetValueCount value="10000" count="71"/>
            </TargetValueCounts>
          </PairCounts>
        </BayesInput>
      </BayesInputs>
      <BayesOutput fieldName="amount of claims">
        <TargetValueCounts>
          <TargetValueCount value="100" count="8723"/>
          <TargetValueCount value="500" count="2557"/>
          <TargetValueCount value="1000" count="1530"/>
          <TargetValueCount value="5000" count="709"/>
          <TargetValueCount value="10000" count="100"/>
        </TargetValueCounts>
      </BayesOutput>
    </NaiveBayesModel>
  </PMML>