It’s All About Data
Inputs —> Model —> Outputs
Input Data: It all starts with data, but what kind?
Many types of data can be used in a model, including unstructured data such as images or natural language, as well as structured, or tabular data. The ability to look at unstructured data is a major distinction between machine learning and basic statistical analytics. However, machine learning usually still involves considerable data pre-processing, and often the data are transformed from one form to another. For example, a model used for identifying fraudulent online transactions does not use the data you might initially think it would¹. Rather than looking at transaction logs or keyboard strokes, the model actually looks at the movements of the mouse's cursor on the screen. It converts those movements into an image (see below) and uses image-recognition to detect patterns in the image. The unique visual patterns provide a way to represent complex behaviors by the user that would be hard to represent through any other data type.
Types of Data
Images
Image-recognition is where deep learning really began to make a name for itself, taking on pattern recognition tasks that previously had not been able to be automated. Neural network models in deep learning, for example, are able to pick up patterns in images that may not be perceptible to the human eye. An important point to note when it comes to image recognition is that the model architecture is often dependent on the size (i.e. number of pixels) of the image. Each pixel of the image is an individual data point that makes up the input going into the model; therefore, a different image size means a different data input size, which could impact results. When training a model for healthcare applications, you want to train it with images of the same resolution and size of those that the model would see in real practice.
Natural language
Natural language refers to the unstructured, nuanced nature of human language, while natural language processing (NLP) is used to describe the way machines make sense of human language. NLP is the ability for humans and machines to interact through the use of everyday language as opposed to code. It is what enables applications such as Siri, Google Home, Amazon Alexa, and Google Translate. Human language is complex. It is not enough to know the meaning of individual words; the machine must translate the intent and meaning of words based on the context in which those words are being used, which can vary widely.
How does a machine detect tone? Jargon? Sarcasm? Informal slang? NLP is a challenging field within machine learning that still has much to learn. Similar to how pixels are extracted from an image and transformed into data points that a model can process, NLP models also have to distill language data into a machine-readable form. It involves a number of techniques for evaluating syntax (i.e. grammar) and semantics (i.e. meaning of words). Ultimately, regardless of the specific technique, the end result is that each word, or group of words, will be represented by a mathematical vector. It is this vector that is passed through the NLP model. When it comes to clinical application, it is important to understand that the data the model is trained on is critically important to the model being able to interpret information accurately. In the medical and scientific world, if an NLP training dataset (called a “corpus”) does not contain sufficient representation of a specific topic or subspecialty, it is likely to be less accurate understanding text from that topic. Unfortunately, this is often the case with medication information and pharmacy-specific jargon, which is lacking in the large corpuses of many popular NLP models.
Tabular Data
Tabular data refer to data that can be represented in a table, such as a spreadsheet. One that many are likely the most familiar with is an Excel spreadsheet. Tabular data are structured and organized within rows and columns. Two types of data should be considered in this case: categorical data, which refers to discrete variables (e.g. colors, cities), and continuous data, which has an infinite number of values (i.e. numbers). Tabular data are abundant in all industries and machine learning can be leveraged for a number of reasons, from optimizing business operations to the forecasting of sales.
References:
Esman G. Splunk and Tensorflow for Security: Catching the Fraudster with Behavior Biometrics. Splunk-Blogs. https://www.splunk.com/en_us/blog/security/deep-learning-with-splunk-and-tensorflow-for-security-catching-the-fraudster-in-neural-networks-with-behavioral-biometrics.html.