Man typing on a laptop

Natural Language Processing (NLP)

What it is and why it matters

Natural language processing (NLP) is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language. NLP draws from many disciplines, including computer science and computational linguistics, in its pursuit to fill the gap between human communication and computer understanding.

Evolution of natural language processing

While natural language processing isn’t a new science, the technology is rapidly advancing thanks to an increased interest in human-to-machine communications, plus an availability of big data, powerful computing and enhanced algorithms.

As a human, you may speak and write in English, Spanish or Chinese. But a computer’s native language – known as machine code or machine language – is largely incomprehensible to most people. At your device’s lowest levels, communication occurs not with words but through millions of zeros and ones that produce logical actions.

Indeed, programmers used punch cards to communicate with the first computers decades ago. This manual and arduous process was understood by a relatively small number of people. These days, you can use generative AI (GenAI) models such as ChatGPT to create code, brainstorm new ideas or summarize research topics.

This technology is made possible by large language models (LLms) using NLP, along with other AI elements like machine learning and deep learning.

Why is NLP important?

Large volumes of textual data

Natural language processing helps computers communicate with humans in their own language and scales other language-related tasks. For example, NLP makes it possible for computers to read text, hear speech, interpret it, measure sentiment and determine which parts are important.

Today’s machines can analyse more language-based data than humans, without fatigue and in a consistent, unbiased way. Considering the staggering amount of unstructured data that’s generated every day, from medical records to social media, automation will be critical to fully analyse text and speech data efficiently.

Structuring a highly unstructured data source

Human language is astoundingly complex and diverse. We express ourselves in infinite ways, both verbally and in writing. Not only are there hundreds of languages and dialects, but within each language is a unique set of grammar and syntax rules, terms and slang. When we write, we often misspell or abbreviate words, or omit punctuation. When we speak, we have regional accents, and we mumble, stutter and borrow terms from other languages.

While supervised and unsupervised learning, and specifically deep learning, are now widely used for modelling human language, there’s also a need for syntactic and semantic understanding and domain expertise that are not necessarily present in these machine learning approaches. NLP is important because it helps resolve ambiguity in language and adds useful numeric structure to the data for many downstream applications, such as speech recognition or text analytics.

Synthetic data and its many uses

Synthetically generated text is often used with NLP models. Want to learn more about what synthetic data is, why it’s so valuable, and how it’s being used today? Watch this explainer video with Brett Wujek, who leads product strategy for next-generation AI technologies at SAS, to hear why synthetic data is so important for the future.

Read the article Read about synthetic data, including how it works and how it relates to NLP

Video Player is loading.

Current Time 0:00

Duration 0:00

Loaded: 0%

Stream Type LIVE

Remaining Time 0:00

NLP in today’s world

Woman using laptop at desk in home office

Data quality via NLP and large language models

With text-related models like LLMs, more data isn't necessarily better – due to potential noise, duplication or ambiguity. When it comes to LLMs, the quality of data directly affects the generated results. Learn how semantic rules-based NLP techniques can help.

Read the blog post Read the first blog post in a series on LLMs and NLP

A man holding a smart phone while sitting at a desk

Natural language processing revamps regulatory responses

To provide rigorous responses to thousands of public comments, government agencies face a grueling, manual sorting process. With NLP, text analytics and generative AI, they can manage this task both effectively and accurately – while keeping experts at the center of the process.

Learn more in the blog post Read a blog post about uses of NLP, GenAI and text analytics in the public sector

Man sitting in office looking concentratedly at laptop

Learn about chatbots and how they work with analytics and AI

A chatbot is a form of conversational AI designed to simplify human interaction with computers. Sophisticated chatbots learn and gather information to adapt to user preferences and provide personalized responses and recommendations – serving as digital AI assistants.

Read the explainer article Learn more about how chatbots work in this explainer article

Man with headphones using a laptop

Make every voice heard with NLP

Discover how machines can learn to understand human language and interpret its nuances; how AI, natural language processing and human expertise work together to help humans and machines communicate and find meaning in data; and how NLP is being used in multiple industries.

Get the e-book Download an e-book to learn more about natural language processing

Why is NLP important?

Large volumes of textual data

Natural language processing helps computers communicate with humans in their own language and scales other language-related tasks. For example, NLP makes it possible for computers to read text, hear speech, interpret it, measure sentiment and determine which parts are important.

Today’s machines can analyze more language-based data than humans, without fatigue and in a consistent way. Considering the staggering amount of unstructured data that’s generated every day, from medical records to social media posts, automation will be critical to fully analyze text and speech data efficiently.

Structuring a highly unstructured data source

Human language is astoundingly complex and diverse. We express ourselves in infinite ways, both verbally and in writing. Not only are there hundreds of languages and dialects, but within each language is a unique set of grammar and syntax rules, terms and slang. When we write, we often misspell or abbreviate words, or omit punctuation. When we speak, we have regional accents, and we mumble, stutter and borrow terms from other languages.

While supervised and unsupervised learning, and specifically deep learning, are now widely used for modeling human language, there’s also a need for syntactic and semantic understanding and domain expertise that are not necessarily present in these machine learning approaches. NLP is important because it helps resolve ambiguity in language and adds useful numeric structure to the data for many downstream applications, such as recognition or text analytics.

Kia uses AI and advanced analytics to decipher meaning in customer feedback

Kia Motors America regularly collects feedback from vehicle owner questionnaires to uncover quality issues and improve products. But understanding and categorizing customer responses can be difficult. With natural language processing from SAS, KIA can make sense of the feedback. An NLP model automatically categorizes and extracts the complaint type in each response, so quality issues can be addressed in the design and manufacturing process for existing and future vehicles.

Kia Optima driving down road

How does NLP work?

Breaking down the elemental pieces of language

Natural language processing includes many different techniques for interpreting human language, ranging from statistical and machine learning methods to rules-based and algorithmic approaches. We need a broad array of approaches because the text- and voice-based data varies widely, as do the practical applications.

Basic NLP tasks include tokenisation and parsing, lemmatisation/stemming, part-of-speech tagging, language detection and identification of semantic relationships. If you ever diagrammed sentences in grade school, you’ve done these tasks manually before.

In general terms, NLP tasks break down language into shorter, elemental pieces, try to understand relationships between the pieces and explore how the pieces work together to create meaning.

These underlying tasks are often used in higher-level NLP capabilities, such as:

Content categorization provides a linguistic-based document summary, including search and indexing, content alerts and duplication detection.
Large language model (LLM)-based classification, particularly BERT-based classification, is used to capture the context and meaning of words in a text to improve accuracy compared to traditional models.
Corpus analysis is used to understand corpus and document structure through output statistics for tasks such as sampling effectively, preparing data as input for further models and strategizing modeling approaches.
Contextual extraction automatically pulls structured information from text-based sources.
Sentiment analysis identifies the mood or subjective opinions within a piece of text (as well as large amounts of text), including average sentiment and opinion mining.
Speech-to-text and text-to-speech conversion transforms voice commands into written text, and vice versa.
Document summarization automatically generates synopses of large bodies of text and detects represented languages in multi-lingual corpora (documents).
Machine translation automatically translates text or speech from one language to another.

In all these cases, the overarching goal is to take language input and use linguistics and algorithms to transform or enrich the text in such a way that it delivers greater value.

These underlying tasks are often used in higher-level NLP capabilities, such as:

Content categorisation. A linguistic-based document summary, including search and indexing, content alerts and duplication detection.
Large Language Model (LLM)-based classification. BERT-based classification is used to capture the context and meaning of words in a text to improve accuracy compared to traditional models.
Corpus Analysis. Understand corpus and document structure through output statistics for tasks such as sampling effectively, preparing data as input for further models and strategising modelling approaches.
Contextual extraction. Automatically pull structured information from text-based sources.
Sentiment analysis. Identifying the mood or subjective opinions within large amounts of text, including average sentiment and opinion mining.
Speech-to-text and text-to-speech conversion. Transforming voice commands into written text, and vice versa.
Document summarisation. Automatically generating synopses of large bodies of text and detect represented languages in multi-lingual corpora (documents).
Machine translation. Automatic translation of text or speech from one language to another.

In all these cases, the overarching goal is to take raw language input and use linguistics and algorithms to transform or enrich the text in such a way that it delivers greater value.

NLP methods and applications

How computers make sense of textual data

Natural language processing adds structure to unstructured data through text analytics, which counts, groups and categorizes words to extract structure and meaning from large volumes of content. This technology is used to explore textual content and generate new variables from raw text that may be visualized, filtered or used as inputs to predictive models or other statistical methods.

NLP and GenAI are used together for many applications, including:

Investigative discovery. Identify patterns and clues in emails or written reports to help detect and solve crimes.
Subject-matter expertise. Classify content into meaningful topics so you can take action and discover trends.
Content creation. Generate new content about specific topics and explain key ideas.

There are many common and practical applications of NLP in our everyday lives. Beyond working with copilots, here are a few more examples:

Have you ever used a chatbot to help resolve a customer service issue? Then you've used NLP tools for search, topic modeling, text generation, entity extraction and content categorization.
Have you ever looked at the emails in your spam folder and noticed similarities in the subject lines? You’re seeing Bayesian spam filtering, a statistical NLP technique that compares the words in spam to valid emails to identify junk mail.
Have you ever missed a phone call and read the automatic transcript of the voicemail in your email inbox or smartphone app? That’s speech-to-text conversion, an NLP capability.

A subfield of NLP called natural language understanding (NLU) has cognitive and AI applications. NLU goes beyond the structural understanding of language to interpret intent, resolve context and word ambiguity, and even generate well-formed human language on its own. NLU algorithms tackle the extremely complex problem of semantic interpretation – that is, understanding the intended meaning of spoken or written language, with all the subtleties, context and inferences that we humans are able to comprehend.

The evolution of NLP toward NLU has a lot of important implications for businesses and consumers alike. Imagine the power of an algorithm that can understand the meaning and nuance of human language in many contexts, from medicine to law to the classroom. As the volume of unstructured information continues to grow exponentially, we benefit from computers’ tireless ability to help us make sense of it all.

SAS^® Visual Text Analytics

How can you find answers in large volumes of textual data? By combining machine learning with natural language processing and text analytics. Find out how your unstructured data can be analyzed to identify issues, evaluate sentiment, detect emerging trends and spot hidden opportunities.

SAS Visual Text Analytics screenshot

Recommended reading