DELIVERABLES

A predictive machine learning program designed to predict and prevent errors for large-format printing press.

Technologies Used

      • .NET
      • Python
      • Clustering algorithms, K-means and DBSCAN

Something Powerful

Tell The Reader More

The headline and subheader tells us what you're offering, and the form header closes the deal. Over here you can explain why your offer is so great it's worth filling out a form for.

Remember:

  • Bullets are great
  • For spelling out benefits and
  • Turning visitors into leads.

TECHNICAL DEEP-DIVE

The AndPlus Innovation Lab is where our passion projects take place. Here, we explore promising technologies, cultivate new skills, put novel theories to the test and more — all on our time (not yours).

ML

Our Research

In order to reduce time wasted on failed jobs, AndPlus wanted to apply data science techniques and statistical learning knowledge to build a prediction model. This model should make a prediction whether this job is likely to fail or not when user submitted job through the website. Once the model identified one job as failure, the system will send the job in a different to minimize the impact on other jobs in this is a failure job. This prediction model will help the printer running more efficiently. AndPlus provided detailed data for both failed and successful job. The following is an example of the data structure of job.
Each job has one background (miscellaneous) layers which contains both width and height of banner and screen. The elements field contains data of images and text fields that users added. The number of image and text layers varies among jobs.

There are 3 stages to the research project.

  1. Data Wrangling
    • We used Python to parse all JSON files into dataframes which can be operated and trained in later steps.
    • We aggregated all JSON files into 3 dataframes: MISC, Image and Text
  2. Feature Engineering
    • This is the process to create new features and select the best of them to increase predictive ability of the machine learning algorithms.
    • Before this step, we explored the dataset by examining the statistical distributions for both successful and failed jobs.
  3. Modeling
    • Model selection is the first step. Due to the data presented in low-dimensional space, we believe a more complex model will perform better among all the classification models currently available.
    • Parameter search: For each model, we applied a grid search algorithm to find the optimal parameter combination for the corresponding model.
    • Model evaluation after training was the next step.
    • ROC Curve: a plot of the true positive rate against the false positive rate for the algorithms attempted
    • A confusion matrix was used to track performance of classification models.

Deliverable

In this project, we parsed the JSON files of over 10,000 print jobs into a structured format. The labs team then ran these jobs through a rigorous data pre-processing and handled all missing values. Once this process was complete, we had an initially high accuracy rate. Because the data was imbalanced, we want to model to focus on minimizing false positives and false negatives.

To further improve the models' performance on predicting failed jobs, we applied a resampling method. The table below compares four situations where we used only raw data, data with feature engineering without resampling, data with just resampling without feature engineering, preprocessed data with both methodologies by accuracy rate, N{V and recall rate indexes. 

 

Machine learning results table AndPlus Innovation Lab

 

From the table above, we can see that our feature engineering and resampling methods have improved the models' performance greatly. So in following model training, we used data with feature engineering and resampling methods.

Machine learning model performance andplus innovation lab

Among all the single models we picked the top three with highest overall performance on accuracy, recall and NPV rates. These were Random Forest, XGBoost and LightGBM. The chart above further compares results raised from the single models with and without lasso regression, along with three ensembling methods.

We can conclude that accuracy rate is at a high level. If we want to have a deeper look; the NPV value, Random Forest and Logistic Regression ensembling methods achieved a fairly good performance. Compared to the Ensemble-LR, random forest has sacrificed a lot in accuracy, and recall is 13% lower. Ensemble-LR returns the best result so far at 95%.

EVERY CASE STUDY HAS A BACKSTORY

See more of our work

From a client:

AndPlus understands the communication between building level devices and mobile devices and this experience allowed them to concentrate more on the UI functions of the project. They have built a custom BACnet MS/TP communication stack for our products and are looking at branching to other communication protocols to meet our market needs. AndPlus continues to drive our product management to excellence, often suggesting more meaningful approaches to complete a task, and offering feedback on UI and Human Interface based on their knowledge from past projects.

Get in touch

LET’S BUILD SOMETHING AWESOME. TOGETHER.

Clients

 
Arthromeda
Bloomberg
crossref
Honeywell Logo
Medica
NexRev
Onset
Predicata