BACK TO PRESS

Model Performance Evaluation in Qeexo AutoML

Dr. Geoffrey Newman, Dr. Leslie Schradin III, Tina Shyuan 10 September 2021

Introduction

Qeexo AutoML provides feedback on trained models through tables and charts. These visualizations can be useful in determining how models trained in Qeexo AutoML should perform in live classification and can help form suggestions for how to improve model performance in specific circumstances, such as diagnosing why a certain class label is hard to classify correctly, or how to deal with an edge-case. In this blog post, we will explore which visualizations are available and how to interpret them.

We will look at two problems and machine learning solutions in this post: the first problem is very difficult to classify, and the results show that the machine learning solution only solves the problem partially. The second problem is easier, the machine learning solution performs better, though it is not perfect, and the results give indications for where and possibly how improvements can be made. Through exposure to both situations, the reader will be able to make better use of Qeexo AutoML to improve their understanding of their data and performance of their libraries.

Visualizations

The various visualizations provided by Qeexo AutoML will be explained here, in the same order as they are presented on the results page. This order was chosen for its practicality: it is the most common order (due to the progression of subtasks) used by machine learning engineers when working on a library.

UMAP

UMAP (Uniform Manifold Approximation and Projection) is a visualization technique documented on arxiv. UMAP provides a down-projection of the feature space into two dimensions to allow for visual inspections, giving information on the separability of instance labels. This is used as an indicator for how well a classifier could perform. Instances are colored based on the event labels, so it can be observed whether distinct labels may be separated based on the available features.

Figure 1: UMAP visualization

PCA

PCA (Principle Component Analysis) transforms the data into a new orthogonal basis, with the directions ordered by their explanation of the variation in the data. The two-dimensional PCA plots in Qeexo AutoML show the data projected onto the first two PCA directions. Similar to UMAP, these PCA plots give an indication of how separable the data may be when building machine learning models.

Figure 2: PCA visualization

Note: Both UMAP and PCA are projections from a higher-dimensional feature space onto two dimensions. Separability of the classes in the two-dimensional plots should imply separability in the higher-dimensional space. However, lack of separability of the classes in the two-dimensional plots does not always imply lack of separability in the higher-dimensional space: the structure of the data in the higher-dimensional space may be such that machine learning models can still separate the data and find good solutions to the problem.

Confusion Matrix

The confusion matrix shows machine learning evaluation results in a grid, giving a breakdown by class of how the evaluation examples have been classified. The rows of this grid correspond to the true label, while the columns correspond to the predicted label. Therefore the diagonal elements represent correct classifications while the off-diagonal elements represent mis-classifications.

The following measures, computed from the confusion matrix, are of use when understanding the ROC curves and F1-score plots in a later section:

  • True positive rate (sensitivity, recall): correctly-labeled positive instances divided by the total number of positive instances
  • True negative rate (specificity): correctly-labeled negative instances divided by the total number of negative instances
  • False positive rate: incorrectly-labeled negative instances (classified as positive) divided by the total number of negative instances
  • False negative rate: incorrectly-labeled positive instances (classified as negative) divided by the total number of positive instances
  • Precision: correctly-labeled positive instances divided by the total number of instances that have been classified as positive

It is important to keep in mind the trade-offs between these values when choosing how to balance a machine learning model. Depending on the task, it may be more important to have a high true positive rate, such as when it can be costly to miss a defective product, for example when performing quality control for automobile airbags. Increasing the true positive rate will have the trade-off of also increasing the false positive rate (or at least keep it constant) with all models. This can be an issue, for example, with a pacemaker. If it is constantly detecting a cardiac event and shocking the user, it could result in tissue damage or other problems if none had occurred (a false positive). Many of the visualizations in Qeexo AutoML derive from the need to balance these measures.

Figure 3: Confusion matrix

Cross-Validation: By-fold Accuracies vs. Classes

Qeexo AutoML performs 8-fold cross-validation when a new model is trained. The data is split into eight separate folds, seven of the folds are used to train the model, and the held-out 8th fold is used to evaluate performance. This process is repeated seven more times for each hold-out fold and the results are combined to determine performance. Training and evaluating via cross-validation allows one to get multiple draws from the training data set, so that the results across the folds are less likely to be accidentally biased in some way due to an unlucky split in the data. The values provided by multiple evaluation folds enables one to build up statistics (e.g. standard deviation) to get an estimate of the error of the evaluation result (e.g. mean value). All of the training data is utilized and treated equally; none of it is just used for training or evaluation.

The results of cross-validation are displayed as a bar graph. Bar heights indicate average by-class accuracy over all folds. Individual fold performance is shown as a scatter plot overlaying the bar for the specific label. The by-fold numerical accuracies are also displayed in a chart on the results page underneath the graph. Note that a given OVERALL accuracy number in the graph and table is the prediction accuracy based on all evaluations in the specific fold; it is not the average of the by-class accuracies within the fold.

Figure 4: Cross-validation plot and table

MCC

The Matthews correlation coefficient (MCC) is a performance measure for two-class classifiers. It takes the correlation between the predicted labels and the true labels after converting them to binary. For Qeexo AutoML we have extended the measure to work with an arbitrary number of classes; this is done by computing an MCC score for each pair of classes based on their confusion sub-matrix.

MCC has an advantage over raw accuracy in situations where the class labels are not well balanced. A good example of this is when Qeexo AutoML is attempting to determine classifier performance in an anomaly detection task where only a few examples of the bad class are available among hundreds or thousands of instances. Using the MCC measure in place of accuracy can give a more well-rounded view of the fitness of the classifier.

Here are some useful pieces of information about MCC:

  • Values are in [–1, +1]
  • Larger values indicate better classifier performance
  • +1 corresponds to a perfect classifier
  • 0 is equivalent to a random classifier
  • –1 corresponds to an exactly-wrong classifier
Figure 5: MCC table

ROC Curve

The ROC (Receiver Operating Characteristic) curve shows how different threshold values (applied to the continuous output of the classifier; e.g. class probability) affects the false positive rate (FPR) and true positive rate (TPR) of the classifier in cross-validation. As different thresholds are selected, the point on the graph with the FPR on the x-axis and TPR on the y-axis is found and plotted as a dot. These dots are then connected by line segments. Qeexo AutoML will generate these curves for each label and overlay them in different colors. This allows the user to determine how trade-offs in performance will have to be made with different thresholds.

The ROC curve can be used to generate summary statistics for describing the fitness of the model. A common one, which we compute with Qeexo AutoML and provide with the ROC curve plot, is the AUC (Area Under Curve). This is the integral of the ROC curve, and it has these properties:

  • Values are in [0, 1]
  • Larger values indicate better classifier performance
  • +1 corresponds to a perfect classifier
  • 0.5 is equivalent to a random classifier
  • 0 corresponds to an exactly-wrong classifier
Figure 6: ROC visualization
Figure 7: ROC curve explanation. Source: Wikipedia (https://commons.wikimedia.org/wiki/File:Roc-draft-xkcd-style.svg)

F1-score

The F1-score is the harmonic mean of precision and recall, each of which have been defined above.

F_1 = 2 * precision * recall / (precision + recall)

Similar to the MCC, the F1-score is a performance measure, with larger values being better. It is in the range of 0 to 1, so the best value is 1. Similar to the ROC curve, we have chosen to plot the F1-scores against the False Positive Rate as we sweep through the threshold values. In the graph, the dotted vertical lines each indicates where the F1 score is maximized for a class.

Figure 8: F1-score visualization

Learning curve

The learning curve runs consecutive cross-validations with varying amounts of training data. The shape of the curve, and in particular the slope of the curve near the high limit of available training data, can be used to estimate whether additional training data could improve classifier performance. Given enough training data (drawn from the same underlying distribution), the learning curve will eventually asymptote to the best possible performance.

Figure 9: Learning curve

Discussion

The example plots above were from a Qeexo AutoML pipeline run with real data collected by Qeexo during the development of an embedded solution. In this case, each data set collected had a linked “testing data” for the specific “training data” label. As a result, two plots are generated for each type of analysis. The left (or center for single column figures) plots for each figure consist of training and evaluation of the model with cross-validation. When right-hand plots are shown, they are generated from evaluations on the “test” data (which is only used for evaluation, not for training).

Task Description

The task consisted of recording microphone data from stepper motors which rotate for 3 seconds, reverse direction, rotate for 3 seconds, and repeat. Some of the motors were determined by a factory quality assurance team to be defective for one of several reasons; those motors were labeled BAD, while the non-defective motors were labeled GOOD. The quality assurance team used sound to perform the classifications, but due to the subtlety of the difference in sound between the GOOD and BAD motors, this task is difficult for humans to perform. As we see from the results above, the machine learning model, trained on microphone sensor data, also struggles with this classification task.

Feature Exploration

This particular example had very hard-to-classify data. The UMAP and PCA plots are consistent with this. While there are some BAD examples that are mostly-separated from the GOOD examples (UMAP: x less than about 2; PCA: distance from origin greater than about 0.1), most of the BAD examples are intermingled with the GOOD examples in these projections.

UMAP and PCA plots such as these could be indicative of improperly-collected or mislabeled data, a lack of information in the sensor stream(s) used for separating the classes, or a feature set that is inadequate for extracting the information necessary for separation.

Performance Analysis

The confusion matrix and cross-validation results show that the model has learned at least something about the problem, but the performance estimates are not great (70 ± 10% overall). The model can correctly classify some of the GOOD training motors (80 ± 10%), but is not doing better than a coin-flip on the BAD training motors (50 ± 30%), although the results vary a lot by training fold. The performance of the model on the held-out test data is a bit better, about 90% on GOOD motors and 60% on the BAD ones; this seems to be consistent with the variation seen in the cross-validation results shown in the by-fold accuracies vs. classes plot and table (Figure 4).

While the classes have some degree of imbalance in this problem (82 good motors, 56 bad motors, with an equal amount of data collected from each), this imbalance is not extreme enough for the overall accuracy to be a useless metric, and it has an advantage over the MCC in one respect: we have intuition about accuracy, but most of us do not have intuition about MCC. Looking at the MCC scores (0.27 for the training data, 0.51 for the testing data), they seem in line with prior analysis: the model has learned something (MCC = 0 is equivalent to coin-flip), the model is far from perfect (MCC = 1 is perfect), and the model performance on the test data is better than on the training data.

The ROC and F1-score curves are generally useful for understanding a model’s performance across a wide range of thresholds, and possibly fine-tuning the threshold to balance the model to the desired trade-off between the classes. The model performance curves for the training evaluations are not encouraging in the case at hand: there does not seem to be a threshold that can improve the overall model performance from the default threshold value used to produce the cross-validation results. The test curve, on the other hand, shows a bit more promise. In the test ROC curve, one can see that by choosing a lower threshold, it is possible to get into the upper–70s in terms of accuracy for both classes simultaneously (True Positive Rate ~ 0.78, False Positive Rate ~ 0.22 -> True Negative Rate ~ 0.78). This is a little better than the overall test accuracy value (76%) computable from the confusion matrix. The ROC and F1-score curves are usually more useful when the model is close to the desired performance.

The learning curve gives an estimate for whether adding more training data will likely result in increased performance for the model under consideration. Unfortunately, in the learning curve plot for the problem at hand, while there is a slight increase in performance on the BAD data from 25k to about 45k training examples, over the same range the performance on the GOOD data is constant or decreasing slightly. The curves as a whole appear to be pretty flat, indicating that more training data of the same type is unlikely to significantly improve performance for the model in question.

Note that in general, learning curve results hold for the given model and hyperparameters set under consideration. It could be that the model is just not complex enough to learn the problem (meaning it has high bias error), and that a more-complex model could do better. There could be a benefit from adding more data (that is, the learning curve when considering a more-complex model and hyperparameters set could have positive slope near the end of the graph). It is also possible that the performance just will not increase beyond the upper bound seen in these results regardless of more data.

Overall, the results show that while the model has learned from the data, it has not learned enough to approach Qeexo’s original goal for this problem: to achieve by-class accuracies greater than 90% on this task. The gap is great enough between actual and desired performance that it indicates some basic change will be needed to improve the situation:

  • verify that the data sources have been labeled correctly
  • use different sensor streams that can capture better separability information
  • use different features for the same reason (for feature-based models)
  • perform more exploratory data analysis and data visualization to better understand the core problem

Alternative Task

The previous figures were generated with a problem and data that is difficult to classify. We also want to give the audience an example problem and data where a model performs better.

Task Description

This task is described as an “air gesture classification”. It consists of using accelerometer and gyroscope on an embedded device which is affixed to a stick used to perform motion gestures. These gestures include raising and lowering the stick, labeled DRUMS, and rocking the stick back and forth, labeled VIOLIN. The BASELINE label consists of not moving the stick, and is the appropriate output for when the user is not moving the stick in either the DRUMS or VIOLIN gesture.

Task Results

The UMAP visualization for this task shows very good separation between DRUMS and the other classes. There is overlap between VIOLIN and BASELINE, although there appears to be a large area of BASELINE outside of this region of overlap.

Figure 10: UMAP – air gesture

The PCA plot gives similar information: DRUMS appears to be quite different from the other two classes in this projection, while VIOLIN and BASELINE appear similar to each other. The fact that DRUMS appears separable seems reasonable: playing drums requires broad sweeping motions that should create large-magnitude swings in the inertial sensors.

These plots indicate that DRUMS should likely be separable from the other two classes. VIOLIN and BASELINE may also be separable by a machine learning model, but there is no good indication of that in these two-dimensional projections.

Figure 11: PCA – air gesture

The confusion matrix shows OK classification. Most of the values are along the diagonal with by-class accuracies of 74% for BASELINE, 97% for DRUMS, and 77% for VIOLIN. The two most commonly-confused classes are BASELINE and VIOLIN, with each being classified as the other about 20% of the time. This result is not surprising given the aforementioned visualizations, but given the nature of the problem, we should expect better: these two gestures are different-enough to be obviously recognized as different by humans performing or observing them, and the differences should be obvious in the sensor streams chosen (accelerometer and gyroscope).

It is surprising that a significant fraction of the BASELINE data (~9%) has been mis-classified as DRUM. These two classes were well-separated in the UMAP and PCA plots, and given the differences in these gestures we might expect these two classes to be the easiest to separate. In addition, the mis-classification is not symmetric: only about ~2% of the DRUM examples are mis-classified as BASELINE.

Figure 12: Confusion matrix – air gesture

The by-fold cross-validation results give valuable information that is not captured by the overall confusion matrix: the model performance varies a lot by fold. Half of the folds show good-to-excellent performance with 95% or greater overall accuracy with reasonable by-class accuracies. This indicates that for these folds, training is probably proceeding well, and that the problem in general is likely solvable by the model in question. The other folds show serious problems: each fold has at least one class with by-class accuracy less than 50%. This indicates that there is some problem with the model, although it’s not clear what the problem might be. We’ll discuss in a later section about what might be going wrong.

Figure 13: Cross-validation – air gesture

The MCC results support what we have seen so far in the confusion matrix and cross-validation results: the problem is solved to some extent, but there is likely an issue with the performance with the BASELINE class. The DRUMS-VIOLIN score is quite good (meaning the model separates these classes well), but the scores involving BASELINE are less promising.

Figure 14: MCC – air gesture

The ROC curves from this model are clearly better than those from the previous example problem, with larger AUC values (all > 0.9). These curves, as well as the F1-score curves, also show that the BASELINE class has lower performance than the other gesture classes.

Figure 15: ROC plot – air gesture
Figure 16: F1-score – air gesture

The learning curve for this model shows that as we increase the amount of data, we are improving the BASELINE classification without decreasing performance for DRUMS or VIOLIN. Based on the confusion we see with the current model, combined with the appearance of the curve not approaching an asymptote (probably), it is reasonable to expect that collecting more data and retraining the model would increase the BASELINE separability.

Figure 17: Learning curve – air gesture

Performance Analysis

Several pieces of information point to poor performance of the BASELINE class, and the by-fold cross-validation results show that there is some sort of problem for half of the training/validation fold pairs. The learning curve result indicates that more data could increase performance of the BASELINE class.

Ideas about possible root causes of the problem seen in the by-fold cross-validation results along with possible actions follow:

  • The training data may not be diverse enough. Specifically in this case, the BASELINE data may not be diverse enough. We at Qeexo have observed this difficulty often in problems in which there is a “baseline” or “background” kind of class. If the BASELINE data collections were performed with the stick held very still, or with just a few grips on the stick, large sections of the BASELINE data could be very similar to each other while also being quite different (from the standpoint of say statistical features like mean and variance of the signal) from other large sections of the BASELINE data. For example, each section might be comprised basically of 0-vector data for the gyroscope, and constant-vector data for the accelerometer, but different and distinct constant accelerometer vectors for each section. Another way to look at it is that the BASELINE data might look like several discrete scenarios that do not connect to each other very well in feature space. This situation can cause machine learning models difficulties in training. Note though that the UMAP and PCA plots do not directly support this hypothesis: the BASELINE data appears mostly clumped together in those plots. On the other hand, there are BASELINE outliers in the PCA plot, and the UMAP plot has a relatively-large area for the BASELINE data. With more rich BASELINE data, it is possible that the machine learning models will learn to generalize better to other BASELINE scenarios not in the training data.
  • There may not be enough training data. By its nature, cross-validation trains on part of the training data and evaluates on the held-out validation set. If some particular scenario (e.g. the user holds the stick with a certain grip) only appears for a short time in the training data, when the data associated with that scenario is mostly or entirely contained within a validation fold, the training fold will contain little or no data associated with the scenario, and the model may not learn and perform well on that scenario. Collecting more training data of the same kind that has already been collected should help in this case. This is supported by the shape of the learning curves. Also notable: more training data is likely to help for the last two points described in this bullet list: if there is overfitting or lack of convergence.
  • There may be a problem with some of the data. If it has not been done already, the data should be visualized to make sure that there are no corrupted parts. Visual inspection of the data should also show human-readable separability for this problem, given the nature of the air gesture use case. If it does not, there is likely something wrong somewhere in the data collection process.
  • The machine learning model may sometimes be overfitting to the training data; that is, it may be learning not only the underlying problem that we want to solve, but also the noise that exists in the training data as well. If so, changing the model hyperparameters in a way to make the model less complex or to increase regularization may help reduce the model variability and improve performance.
  • Depending on the model type used, the model may have failed to converge for some of the folds. We at Qeexo have observed this most often for some neural network models. This may be considered a sort-of sub-case of the previous point. Changing the hyperparameters may improve performance in this case as well. For example, increasing the number of epochs while also reducing the learning rate may help.

We think the first two points above are most likely the best explanations of the problems seen in the evaluations, and contain the best ideas for fixing the problems: collect more training data, and in particular collect more BASELINE data in a way that gives more diversity for the data for that class.

A variation of this use case after improvements had been recorded into a demo.

Conclusion

Qeexo AutoML not only provides model-building functionality, but also presents visualizations of the data and evaluation metrics for the trained models. By understanding these various visualizations and metrics, one can gain insights into the problem the model is attempting to solve, as well as how the model is performing on the collected data and how it might be expected to perform when deployed to the device. (Deployment to device can be done with just one click on Qeexo AutoML.) Such insights can lead to useful actions that can significantly improve the performance of the machine learning model.

BACK TO PRESS

Arm’s Startup Day Shows its Support for Future Hardware Startups

All About Circuits 18 August 2021

Finding new tech often means looking at startups. To support this critical aspect of the tech industry, Arm launches its “Startup Day” event. What is it, and what were some key takeaways?

Read the full article here: https://www.allaboutcircuits.com/news/arms-startup-day-shows-its-support-for-future-hardware-startups/

BACK TO PRESS

Cross-Platform Swift at Scale

Alex Taffe 14 July 2021

The dream of every developer is to be able to write code once and deploy everywhere. This allows for smaller teams to deploy more reliable products to a wider variety of users. Unfortunately, due to the numerous different operating systems with varying underlying kernels and user interface APIs, this often turns into a nightmare for developers.

Why interpreted languages are insufficient

Interpreted languages like Python in theory are great for cross platform code deployment. There are interpreters available for nearly any platform in existence. However, in a number of crucial ways, they fall short. First is the relative difficulty in protecting intellectual property. Because the code is not compiled, it is viewable to anyone in plain text. This requires techniques such as obfuscation or Cython compilation, both of which make debugging and deployment very complicated. Even semi-compiled languages such as Java can suffer from this problem.

Another major issue with shipping interpreted code to end users is the installation size. Shipping the python interpreter, pip, a number of pip packages, and the code can easily exceed a gigabyte on disk for even a simple application. Electron suffers from similar issues, having to bundle an entire Chromium install. This often results in subpar apps that love to consume memory.

Finally, dependency management on end users’ machines tends to be very unreliable with Python. If the code a developer is shipping relies on a specific package version (for example, Tensorflow 2.4.2), an install script attempting to install that package may encounter a version conflict. The script at this point must make the decision to abort or overwrite the user’s existing package. Both of these result in an extremely poor user experience. Packaging tools such as pipenv attempt to solve this, but this assumes that the user has the tool installed and that the version the user has installed matches that of the install script. Another way to solve this is to create an entirely separate Python install on the user’s machine, but this takes up unnecessary space and is still not immune to reliability issues and version conflicts. Additionally, the requirement of an internet connection is a concern for customers whose internet access may be limited or restricted for security reasons.

Compiled language candidates

Compiled languages are an alternative to interpreted languages that are able to fulfil the requirements of intellectual property protection, small installation size, dependency management, and reliability. Below we consider a few:

C/C++
Pros

  • Mature compilers available for every OS
  • Decades of well documented and tested libraries that are multiplatform
  • Well understood by developers everywhere

Cons

  • No support for modern language features such as copy-on-write, optionals, built-in JSON parsing, etc.
  • Weakly typed with no null protections
  • Inconsistent standard library between Unix-based and Windows-based platforms without additional dependencies
  • No standard package manager

GO
Pros

  • Fast compiles
  • Relatively simple syntax
  • Statically typed

Cons

  • Garbage-collection-based memory management can lead to performance issues
  • No generics
  • Omits some features from more complex languages that help to reduce boilerplate code
  • Somewhat of a niche language that reduces overall usability due to the lack of available open source packages and community help

Rust
Pros

  • Modern language with strict typing, generics, patterns, etc.
  • “Automatic” memory management at compile time
  • Large community support

Cons

  • No stable ABI, with no plans to adopt one
  • Steep learning curve
  • Somewhat complex C bindings, with native Rust libraries unable to bridge the gap

Choosing Swift

Swift was originally developed to run on Apple platforms only. In 2016, Linux support was added. Since then, a large number of Swift packages have been designed to be compatible with both operating systems. In 2020, Windows support was added, completing support for the trio of major operating systems. For Qeexo, Swift ticked all the boxes. It is a modern language with features like type safety, optionals, generics, and a useful standard library, while also being cross platform, reliable, producing relatively small binaries, and protecting intellectual property. Additionally, with easy bindings into C code, it is easy to write code that interacts with well-tested, cross-platform libraries such as libusb.

Moving from Qeexo’s existing Python-based desktop client to one written purely in Swift has taken the installer size down from 108MB to 18MB, the install time from 57 seconds to 13 seconds, and from 830MB on disk to 58MB on disk. Additionally, the reliability of the installer and first-launch experience has improved substantially, with no dependency issues occurring because of lack of internet, version conflicts, and bad Python installs.

Write Swift once, run it everywhere

Thanks to a robust standard library, many of the operations required for our application such as network requests, JSON serialization, error handling, etc., simply work no matter which platform the code is running on. However, for complex tasks like USB device communication or WebSocket servers, more complex techniques were required. Libraries such as SwiftSerial or Vapor exist but are only supported on macOS and Linux due to the relatively new support of Swift on Windows. Rather than attempting to create a reliable WebSocket server or USB interface code from scratch on Windows in Swift, it was easier to hook into C libraries such as libwebsocket and libusb. Using a simple Swift package and module map (for example: https://github.com/NobodyNada/Clibwebsockets), Swift is easily able to pick up statically-linked libraries or dynamically-linked libraries (.dylib on Mac, .so on Linux, and .dll on Windows), and provide them with Swift syntax after a simple module import. These libraries are already cross-platform, so the work is already done. It only requires integrating a build script for the library into the overall compile process.
Using these techniques, very few locations required #if os() checks in our code. These included paths for log locations, paths for compiler flags in the package file, and USB path discovery. Everything else works the same across all operating systems.

Challenges with cross-platform Swift

As of the time this blog was published, Swift has only been available on Windows for less than a year. This means that resources such as Stackoverflow questions and libraries designed for Swift on Windows are sparse. Everything has to be built from scratch, discovered in the documentation, integrated from a C library, or discovered by looking at the Swift language source code. This will improve over time, but for the moment, cross-platform Swift programming can be a bit daunting for new users. For example, Alamofire, a very popular networking library for Swift, did not support Windows nor Linux until very recently. Qeexo contributed patches to the open-source library to get it into a working state.

            Another issue is that Swift is not considered ABI-stable on Windows nor Linux yet. This means that all apps have to bundle the Swift runtime, rather than being able to take advantage of shared libraries. The library is not­ overly bloated, so this is not a major concern, but it is still a consideration.

Finally, even though Swift does run cross-platform, often times special considerations have to be made. For example, the function sleep() is not part of the Swift standard library. It comes from POSIX. This means that code using the function will not compile on Windows. Instead, helper functions using Grand Central Dispatch and semaphores had to be developed to simulate a sleep function. In addition, UI libraries are not standardized. macOS uses Cocoa, Windows uses WinUI/WPF, and Linux uses Xlib. Swift is capable of interacting with all 3 APIs, but all code will require #if os() checks and the code will be different per OS. As a result, Qeexo chose to write small wrapper apps in the OS’s preferred language in Swift/C#, then spawn the cross-platform Swift core as a secondary process. An added benefit is that the helper app is able to watch the subprocess for crashes and alert the user in a seamless fashion.

A Swift Future

Thanks to an active community, a group dedicated to Swift on the server, and the backing of one of the largest technology companies on the planet (Apple), cross-platform Swift has a bright future. Eventually, SwiftUI might even allow for truly cross-platform UI development. In the meantime, Swift has allowed Qeexo to deploy a reliable and compact app that protects intellectual property and is truly cross-platform.

BACK TO PRESS

IoT News of the Week for July 9, 2021

Stacey on IOT 12 July 2021

Anyone can now build smarter sensors using Qeexo and STMicroelectronics sensors: This is a cool example of ML at the edge. Chip vendor ST Microelectronics is working with Qeexo, a startup that builds software to easily train and generate machine learning algorithms, to ensure that its sensors will easily work with Qeexo software.

Read the full article here: https://staceyoniot.com/iot-news-of-the-week-for-july-9-2021/

BACK TO PRESS

Qeexo adds AutoML to STMicro MLC sensors to speed tinyML, IIoT development

Fierce Electronics 08 July 2021

Machine learning developer Qeexo and semiconductor STMicroelectronics have teamed up to allow STMicro’s machine learning core sensors to leverage Qeexo’s AutoML automated machine learning platform that accelerates the development of tinyML models for edge devices.

Read the full article here: https://www.fierceelectronics.com/iot-wireless/qeexo-adds-automl-to-stmicro-mlc-sensors-to-speed-tinyml-iiot

BACK TO PRESS

Machine-learning capable motion sensors intended for IoT

New Electronics

Qeexo, developer of the Qeexo AutoML automated machine-learning (ML) platform, and STMicroelectronics have announced the availability of ST’s machine-learning core (MLC) sensors on Qeexo AutoML.

Read the full article here: https://www.newelectronics.co.uk/electronics-news/iot-machine-learning-capable-motion-sensors/238717/

BACK TO PRESS

Qeexo and STMicroelectronics Speed Development of Next-Gen IoT Applications with Machine-Learning Capable Motion Sensors

Qeexo / STMicroelectronics 07 July 2021

Qeexo and STMicroelectronics Speed Development of Next-Gen IoT Applications with Machine-Learning Capable Motion Sensors


Mountain View, CA and Geneva, Switzerland, July 7, 2021

Qeexo, developer of the Qeexo AutoML automated machine-learning (ML) platform that accelerates the development of tinyML models for the Edge, and STMicroelectronics (NYSE: STM), a global semiconductor leader serving customers across the spectrum of electronics applications, today announced the availability of ST’s machine-learning core (MLC) sensors on Qeexo AutoML.

By themselves, ST’s MLC sensors substantially reduce overall system power consumption by running sensing-related algorithms, built from large sets of sensed data, that would otherwise run on the host processor. Using this sensor data, Qeexo AutoML can automatically generate highly optimized machine-learning solutions for Edge devices, with ultra-low latency, ultra-low power consumption, and an incredibly small memory footprint. These algorithmic solutions overcome die-size-imposed limits to computation power and memory size, with efficient machine-learning models for the sensors that extend system battery life.

“Delivering on the promise we made recently when we announced our collaboration with ST, Qeexo has added support for ST’s family of machine-learning core sensors on Qeexo AutoML,” said Sang Won Lee, CEO of Qeexo. “Our work with ST has now enabled application developers to quickly build and deploy machine-learning algorithms on ST’s MLC sensors without consuming MCU cycles and system resources, for an unlimited range of applications, including industrial and IoT use cases.” 

Adapting Qeexo AutoML for ST’s machine-learning core sensors makes it easier for developers to quickly add embedded machine learning to their very-low-power applications,” said Simone Ferri, MEMS Sensors Division Director, STMicroelectronics. “Putting MLC in our sensors, including the LSM6DSOX or ISM330DHCX, significantly reduces system data transfer volumes, offloads network processing, and potentially cuts system power consumption by orders of magnitude while delivering enhanced event detection, wake-up logic, and real-time Edge computing.” 

About Qeexo

Qeexo is the first company to automate end-to-end machine learning for embedded edge devices (Cortex M0-M4 class). Our one-click, fully-automated Qeexo AutoML platform allows customers to leverage sensor data to rapidly build machine learning solutions for highly constrained environments with applications in industrial, IoT, wearables, automotive, mobile, and more. Over 300 million devices worldwide are equipped with AI built on Qeexo AutoML. Delivering high performance, solutions built with Qeexo AutoML are optimized to have ultra-low latency, ultra-low power consumption, and an incredibly small memory footprint. For more information, go to www.qeexo.com.

About STMicroelectronics

At ST, we are 46,000 creators and makers of semiconductor technologies mastering the semiconductor supply chain with state-of-the-art manufacturing facilities. An independent device manufacturer, we work with more than 100,000 customers and thousands of partners to design and build products, solutions, and ecosystems that address their challenges and opportunities, and the need to support a more sustainable world. Our technologies enable smarter mobility, more efficient power and energy management, and the wide-scale deployment of the Internet of Things and 5G technology. Further information can be found at www.st.com.

For Press Information Contact:

Lisa Langsdorf
GoodEye PR for Qeexo
Tel: +1 347 645 0484
Email: [email protected]

Michael Markowitz
Director Technical Media Relations
STMicroelectronics
Tel: +1 781 591 0354
Email: [email protected]

BACK TO PRESS

Tree Model Quantization for Embedded Machine Learning Applications

Dr. Leslie J. Schradin, III 28 May 2021

This blog post is a companion to my talk at tinyML Summit 2021. The talk and this blog overlap in some content areas, but each also has unique content that complements the other. Please check out the video if you are interested.

Why Quantization?

For embedded applications, size matters. Compared to CPUs, embedded chips are much more constrained in various ways: in memory, computation ability, and power. Therefore, small models are not just desirable, they are essential. We at Qeexo have asked ourselves: how can we make small, high-performance models? We have dedicated resources to answering this question, and we have made advances in compressing many of our model types. For some of those model types, compression has been achieved through quantization.

For our purposes here, the following is what we mean when we “quantize” a machine learning model:

Change the data types used to encode the trained-model parameters from larger-byte data types to smaller-byte data types, while retaining the behavior of the trained model as much as possible.

By using smaller-byte data types in place of larger-byte data types, the quantized machine learning model will require fewer bytes for its encoding and will therefore be smaller than the original.

Why Tree Models?

In our experience at Qeexo we have found that tree-based models often outperform all other model types when the models are constrained to be small in size (e.g. < 100s KB) or when there is relatively little available data (e.g. < 100,000 training examples). Both of these conditions are often true for machine learning problems targeting embedded devices.

Memory constraints on embedded devices

As stated above, due to the memory constraints on the device the models must be small. For example, the Arduino Nano 33 BLE Sense has 1 MB CPU flash memory and 256 KB SRAM. These resources are shared among all programs running on the device, and the available resources for a machine learning model are going to be significantly smaller than the full amounts. The actual resources available for the machine learning model will vary by use-case, but a reasonable estimate as a starting point is that perhaps half of the resources can be used by the machine learning model.

Available data for embedded machine learning applications

As for the amount of available data, several factors tend to lead to relatively little available data
for machine learning models targeting embedded devices:

  • To train a machine learning model that is to be used on a given embedded platform, the training data must be collected from that platform. Large datasets collected from embedded platforms are not generally available online.
  • The organization that wishes to produce the machine learning model must usually collect the data on its own. This is costly in time and effort.
  • Machine learning applications usually require specialized training data. This means that a data collection effort usually needs to be performed for each problem that is to be solved.

Because tree-based models tend to perform well, and are often the superior model type, under the constraints above, they are well-suited for embedded devices. For this reason, we are interested in using them for our own embedded applications and in offering them as part of our suite of models for use in Qeexo AutoML.

Tree Model Quantization

Encodings of trained tree-based models lead to parameters and data that fall into 3 categories:

  1. Leaf values: these are the values that carry the information about which class or regression value is the prediction for the instance in question during inference.
  2. Feature threshold values at the branch nodes: these values determine how an instance traverses the tree during inference, and eventually which leaf is reached in the end.
  3. The tree structure itself: this is the information encoding how the various nodes of the tree connect to each other.

The first two categories above, leaf values and feature thresholds, provide the possibility for quantization. At Qeexo, we have implemented quantization for these parameter types. While there has been research on and implementations for quantization of neural-network-based models, we aren’t aware of similar research and implementations for tree-based models; we have done our own research and created an implementation for quantization for tree-based models.

Quantization Gains and Costs

Our main gain for quantizing tree-based models: model compression

In our experience, the strongest constraint placed on tree models by embedded chips is the flash size. By quantizing the model, the number of bytes needed to encode the model is smaller. This allows tree-based models to fit on the device that would not have fit without quantization, or in the case of smaller tree-based models, quantizing them leads to more room on the embedded device for other functionality.

The main cost for quantizing tree-based models: loss of model fidelity

Reducing the size of the data types used to encode the trained model parameters leads to some loss of information for those parameters, and this leads to a difference in behavior between the original model and the quantized version. The difference in behavior is often small, but if it is not small enough it can sometimes be made smaller by giving up some compression gains (i.e. increase the size of the quantized data type; in the “Leaf Quantization Example” below, one could use the 2-byte uint16 instead of the 1-byte uint8 as the quantized data type. This would make the quantized model more faithful, but also bigger).

There is another factor that can lead to a reduction in compression gains when quantizing tree-based models: depending on the tree model in question and on the quantization type, it may be necessary to store some of the parameters of the quantization transformation along with the model on-device. This is because in some cases it is necessary to de-quantize one or more quantized parameters during inference to correctly produce predictions or prediction probabilities. Storing these transformation parameters costs bytes on-device, which eats into the compression gains. The number of transformation parameters required depends strongly on the quantization type. For example, when using a simple quantization scheme such as an affine transformation (a shift followed by a rescaling), the leaf quantization transformation parameters are 2 floating-point parameters if all leaf values are quantized together. On the other hand, for feature threshold quantization, each feature used in the model requires a stored transformation, which leads to 2 floating-point parameters per used feature. The former costs very little (a few bytes), and is applicable to any tree implementation with floating-point leaf values or integer leaf values when the values are large enough. The latter can cost many bytes depending on the number of features, and should only be considered for large models where the number of splits per feature across the tree-based model is large enough to absorb the cost of storing the feature threshold quantization parameters.

Leaf Quantization Example

To illustrate the gains and costs of tree model quantization, we’ll consider an example using leaf quantization. For our example, consider a “letter gesture” problem, a benchmark problem we use often at Qeexo. The setup:

  • Arduino Nano BLE Sense, held in the hand
  • Accelerometer and gyroscope sensors active
  • 4 classes of letter gestures, traced out in the air: A, O, S, V
  • 1 class of no gesture
  • Qeexo AutoML accelerometer and gyroscope feature stack used for the features
  • Data after featurization: approximately 300 examples for each letter gesture, and 1500 examples for the no-gesture case
  • Train/test split: approximately 2/3 of the data is used for training, 1/3 as the test set
  • Model: sklearn.ensemble.GradientBoostingClassifier with default parameters (mostly)

The Un-Quantized Model

  • Overall accuracy on the held-out test set: 96.1% (927/965)
  • Confusion matrix:
true/predANOISEOSV
A1330202
NOISE0543000
O009810
S3121950
V800058
  • The model performs pretty well. The most difficult class for the model to classify correctly is ‘S’, which is at times classified as ‘O’
  • Bytes to encode the model and the breakdown into categories:
ValueNumberEncodingBytes
Leaves3595float3214380
Feature Thresholds3095float3212380
Tree Structurevarious uint12884
Total39644

The breakdown in bytes will generally follow this pattern, with the leaves and feature thresholds making up about 1/3 of the total model size each (with leaves always a bit more; for each tree there is always one more leaf than the number of splits), with the tree structure making up the remaining 1/3.

To quantize the model, we need to choose what parameters to quantize and the quantization transformation. For this example, we are quantizing the leaves only, and let us quantize them via an affine transformation (a shift followed by a rescaling) down to 1-byte uint8 values. With this transformation, we will reduce the bytes required to encode the leaves by a factor of 4. For this tree implementation, we do need to retain one of the leaf quantization parameters on-device for use during inference, but this costs very little: a single 4-byte float32. This value is used to rescale the accumulated leaf values back into the un-quantized space before applying the softmax function to compute the by-class probabilities.

The Quantized Model

  • Overall accuracy on the held-out test set: 95.8% (924/965)
  • Confusion matrix:
true/predANOISEOSV
A1330202
NOISE0543000
O009810
S3124 92 0
V800058
  • Bytes to encode the model and the breakdown into categories:
ValueNumberEncodingBytes
Leaves3595uint83595 
Feature Thresholds3095float3212380
Tree Structurevarious uint12884
Leaf quant params1float32
Total28863 

Consequences of Quantization

  • Checking the test example predictions shows that the only difference in the predictions is that the quantized model mis-classifies 3 previously-correct ‘S’ examples as ‘O’.
  • All 3 examples which were predicted differently are borderline cases for which the un-quantized model was already fairly confused.
  • The changes in probabilities between the two models is very small, with the maximum absolute difference in probabilities (across all test examples and classes) being about 0.013.
  • The multi log loss (computed via sklearn.metrics.log_loss) has increased a small amount: from 0.0996 for the un-quantized model to 0.0998 for the quantized model.
  • The number of bytes necessary to encode model parameters has been reduced by 27%.

The quantization procedure effectively introduces some small amount of noise into the model parameters. We expect that this will on average degrade performance, and this is seen here in this particular example: quantizing the model led to slightly lower performance on the held-out data.

So, at the cost of a slight loss in model fidelity we have reduced the overall model size by 27%. The trade-off observed in this example matches our experience at Qeexo with quantized models generally: a slight loss in fidelity with a significant reduction in model size. In the context of an embedded device, this amount of reduction is often a worthwhile trade-off.

Conclusion

In this post, we have discussed why compressed tree-based models are useful models to consider for embedded machine learning applications, and have focused on a particular compression technique: quantization. Quantization can compress models by significant amounts with a trade-off of slight loss in model fidelity, allowing more room on the device for other programs.

The quantized tree-based models are among the model types available for use in Qeexo AutoML, an end-to-end automated machine learning pipeline targeting embedded devices. To learn more and try out Qeexo AutoML for free, head to https://qeexo.com/ml-platform/.

BACK TO PRESS

Lose the MCU: ST and Qeexo Simplify Machine Learning with AutoML

All About Circuits 19 May 2021

Intelligent decision-making has moved to the edge with machine learning, though deployment could be complicated. Qeexo and STMicroelectronics are looking to change that.

Read the full article here: https://www.allaboutcircuits.com/news/lose-the-mcu-st-and-qeexo-simplify-machine-learning-with-automl/