Home Android & Kotlin Tutorials

Regular Expressions in Kotlin

Learn how to improve your strings manipulation with the power of regular expressions in Kotlin. You’ll love them!

Version

  • Kotlin 1.5, Other, IntelliJ IDEA

String manipulation and validation are common operations you’ll encounter when writing Kotlin code. For example, you may want to validate a phone number, email or password, among other things. While the string class has methods for modifying and validating strings, they’re not enough.

For example, you can use endsWith to check whether a string ends with numbers. However, there are cases where using only simple string methods could make you write more code than needed.

Imagine you want to check whether a string is a phone number. The pattern for a phone number is three number strings separated by dashes. For example, 333-212-555 is a phone number, but 3333/2222 isn’t.

In this case, using string methods would work but could be very cumbersome. You’d need to split the string then check whether each result is a number string.

Regular expressions, or regex, are tools that can help you solve string validation and string manipulation problems in a more compact way.

In this tutorial, you’ll learn:

  • How to create a regex pattern string and a Regex object.
  • Regex‘s methods.
  • Character classes, groups, quantifiers and boundaries.
  • Predefined classes and groups.
  • Greedy quantifiers, reluctant quantifiers and possessive quantifiers.
  • Logical operator and escaping a regex pattern.
Note: This tutorial assumes you know the basics of Kotlin and how to use IntelliJ IDEA. If you’re new to IntelliJ IDEA, read IntelliJ IDEA’s installation guide.
If you’re new to Kotlin, check out this Kotlin Apprentice book. For more information about Spring web development, you can read Kotlin and Spring Boot: Getting Started.

Getting Started

Download the materials using the Download Materials button at the top or bottom of this tutorial. Open IntelliJ IDEA and import the starter project.

Before you start working, take a moment to learn the app’s backstory.

Understanding the Backstory

Superheroes are becoming more dominant and popular, which worries supervillains. So they decided to join the forces to build Supervillains Club, a club dedicated to advancing the interest of villains and recruiting people to become supervillains. With more supervillains going for retirement soon, they need fresh blood.

Supervillains Club has a rudimentary web app, but it’s not sufficient. Right now, anyone can register in the web app, including superheroes masquerading as supervillains. The web app needs string validations to filter them and other features that require string methods.

You just accepted a job as a software engineer at Supervillains Club. Your task is to write functions using regex to validate, modify and process strings. By your work, supervillains will rise to glory!

But first, you need to familiarize yourself with the web app, a Spring web application. Take a moment to examine the code.

The main files are:

  1. SupervillainsClubApplication.kt: The Spring app.
  2. SupervillainsClubController.kt: The controller which maps the URL paths to the methods serving the requests.
  3. SupervillainsService.kt: The thin layer between the controller and the model.
  4. Supervillain.kt: The model of a supervillain.
  5. RegexValidator.kt: The core library that processes strings with regex. This is where you’ll write your code.

There are also test files in the tests directory, SupervillainsClubControllerTest.kt and RegexValidatorTest.kt.

Building and Running the Web App

Build and run the web app. Then open http://localhost:8080. You’ll see the welcome screen and read the Supervillains Club mission:

Supervillains Club Landing Page

Click Register to enter the signup page. You’ll see a form asking for your name and description:

Supervillains Club Register Page

Choose batman for Name and A superhero from DC. for Description. Then click Register. You’ll see the following screen:

Supervillains Club List Page

Mayday, mayday: batman infiltrated Supervillains Club!

You need to find a solution for that as soon as possible!

It’s tempting to check whether a particular string is batman. But you got a requirement to prevent batman, batwoman, catman and catwoman from infiltrating Supervillains Club. It has to take case-insensitivity into account, so Batman is a no-go as well.

You may think of this solution:

fun validateName(name: String): Boolean {
  val lowerName = name.toLowerCase()
  if (lowerName=="batman" || lowerName=="batwoman" || lowerName=="catman" || lowerName=="catwoman") {
      return false
  }
  return true
}

That works, but it’s not scalable. What if you want to prevent strings with stretched vowels, like batmaaan and batmaaaaaaaaaaaaaaan? The if condition swould be too unmanageable.

It’s time to put regex to good use: preventing superheroes from entering Supervillains Club. :]

The Regex Object

Regex pattern is a string with a particular syntax and has rules. It’s complex and confusing, even for veterans. In this tutorial, you’ll learn common rules that cover 80% of your daily needs.

A regex string can be like a normal string, for example, batman. But you can’t do anything with this string. You have to convert it into a Regex object.

To convert this regex string into a Regex object, you feed the string as the argument to the Regex class constructor:

val pattern = Regex("batman")

Now, you have a Regex object.

Regex has a couple of methods. The method that’s suitable for your purpose is containsMatchIn.

Find validateName in RegexValidator.kt. Replace the comment // TODO: Task 1 with:

val pattern = Regex("batman")
if (pattern.containsMatchIn(name)) {
  return false
}

containsMatchIn returns true if batman is in the name variable.

Build and run the app. Open http://localhost:8080/register. Fill batman for Name and A superhero from DC. for Description, then click Register:

Supervillains Club Validation on Register Page

Nice! batman can’t enter Supervillains Club. However, if you try with Batman, your validation will fail.

You’ll use RegexOption to solve that problem.

Using RegexOption

You can make regex case-insensitive by passing RegexOption.IGNORE_CASE as the second parameter to Regex‘s constructor. Replace Regex(“batman”) with:

val pattern = Regex("batman", RegexOption.IGNORE_CASE)

Build and rerun the app. Choose Batman for Name then submit the form.

This time, the validation works perfectly. You finally prevented Batman from entering Supervillains Club for good.

Supervillains Club Batman with Capital B Validation

Note: Along with RegexOption.IGNORE_CASE you may have other regex options.

You can use more than one RegexOption. Can you guess how? :]

[spoiler title=”Solution”]
You can pass a set of RegexOptions instead of a single value:

val pattern = Regex("batman", setOf(RegexOption.IGNORE_CASE, RegexOption.MULTILINE))

[/spoiler]

You can read more about RegexOption in the official documentation.

Flag expression

RegexOption‘s purpose is to alter the behavior of a regex. But you can achieve the same results without using RegexOption by writing the rule in the regex string, like this:

val pattern = Regex("(?i)batman(?-i)")

You get the same result as using Regex.IGNORE_CASE.

This strange syntax is a flag expression. Flag expressions have special meanings. (?i)batman(?-i) doesn’t mean the regex string matches the (?i)batman(?-i) string exactly.

The regex engine interprets the flag expressions differently than normal characters. (?i) tells the regex engine to treat the characters case-insensitively from now on. On the other hand, (?-i) tells the regex engine to treat the characters case-sensitively from this point on.

So (?i)b(?-i)atman means only b is case-insensitive. The rest of the characters are case-sensitive.

But for this example, you’ll use only RegexOption.

Understanding Character Classes, Groups, Quantifiers and Boundaries

Another problem appears. A superhero called catman enters Supervillains Club. How do you forbid both catman and batman?

With a standard string method, you can use the if condition with a logical operator. But you’ll use regex.

You want to check whether the string is batman or catman. Notice, only one character is different. The rest characters, atman, are the same.

Using Character Classes

You can use a character class to group b and c. Replace your pattern line with:

val pattern = Regex("[bc]atman", RegexOption.IGNORE_CASE)

The [ and ] create a character class. [bc] means either b or c. [aiueo] means vowels.

There are special characters inside square brackets. If you want to negate the characters, you can use ^. [^aiueo] means any characters other than vowels.

You can also use - to create a range of characters. [a-z] means a, b, c until z.

Build and run the app. Try to input catman. The validation works flawlessly.

Supervillains catman Validation in Registration Form

Next, you’ll take a look at groups and quantifiers.

Using Groups and Quantifiers

All is well until batwoman breaks into Supervillains Club. Now you need to prevent batman and batwoman as well. Notice, the difference is the wo string: You can’t use the character class to solve this problem.

bat[wo]man means batwman or batoman. It doesn’t match batwoman.

What you want is a group.

Add this new rule to the existing regex syntax. Replace your pattern line with:

val pattern = Regex("[bc]at(wo)?man", RegexOption.IGNORE_CASE)

Here, you use ( and ) to create a group. (wo) means a group of the wo string. Groups make characters a single unit.

You want to make this group optional and you apply ? after the group. The regex string is bat(wo)?man.

? is a quantifier. A quantifier defines how many occurrences of a unit. There are a few of varieties of quantifiers in regex:

  • ?: 0 or 1 occurrence.
  • +: 1 or unlimited occurrences.
  • *: 0 or unlimited occurrences.

You could use quantifiers to match occurrences of a unit:

  • ba+ matches ba and baaaaa fully, but doesn’t match b.
  • ba* matches b, ba and baaaaa fully.
  • ba? matches b and ba fully, but only matches baaaa partially.

In your group, (wo)?, the syntax means the group on the left side of ? is either one occurrence or nothing.

That’s the purpose of the group. w and o in wo aren’t separable.

Build and run the app. Check to see that batwoman and catwoman can’t enter Supervillains Club.

Supervillains Club batwoman Validation

Supervillains Club catwoman Validation

What if you hadn’t used a group so that the regex string should have been batwo?man?

[spoiler title=”Solution”]
That means the ? modifier only applies to the o character. So batwo?man matches batwman.

You don’t want this. You want either batman or batwoman, but not batwman.
[/spoiler]

Using Boundaries

You’re satisfied with your superb code: You protected Supervillains Club from superheroes. Then one day, a supervillain named I'm not Batman tries to register, and the validation stops the supervillain.

You get a complaint from your employer.

Now, you need to add a logic that the regex string needs to match batman, catman, batwoman and catwoman only if they appear at the beginning of the string.

Use a boundary to solve this problem. Add ^ in the front of the regex string. Then replace your pattern line with:

val pattern = Regex("^[bc]at(wo)?man", RegexOption.IGNORE_CASE)

The ^ character doesn’t have the same meaning as the ^ character inside the brackets. ^bat means bat at the beginning of the string. [^bat] means any characters other than b, a and t.

Build and run the app. Now I'm not Batman can register successfully in Supervillains Club.

Supervillains Successful Registration for "I'm not Batman"

Note: If you want to match the regex string at the end of the string, use $. So bat$ means bat at the end of the string.

Regex Helper Tools from IntelliJ IDEA

Sometimes when writing your regex pattern, you want to check if it works as soon as possible, even without running your app. For this purpose, use regex helper tools from IntelliJ IDEA.

Move your caret to the regex pattern and press Alt-Enter on Linux/Windows or Option-Enter on Mac:

Check RegExp Menu

You have two helper tools dealing with regex. One edits the regex fragment, and the other checks the regex pattern.

Choose Check RegExp:

Valid Result in Check RegExp Form

You have a form to validate an input string with your regex pattern. If the input string matches the regex pattern, you’ll see a green check mark.

If you put in an invalid input string:

Invalid Result in Check RegExp Form

You’ll see a red exclamation mark.

If you find that your regex pattern doesn’t work as expected, go back to your regex pattern. Press Alt-Enter or Option+Enter again:

Edit RegExp Fragment

Then choose Edit RegExp Fragment:

RegExp Fragment Editor

You’ll see a dedicated editor for your regex pattern where you can edit your regex pattern and get hints. For example, delete ) after the alphabet o:

RegExp Fragment Editor Validation

You’ll see a warning about the missing ).

For this regex pattern, an editor is overkill. But it can be handy while editing a complex regex.

Understanding Predefined Classes and Groups

You were working on the signup page when you got a distress call: Some superheroes have infiltrated Supervillains Club, and you need to root them out!

Open http://localhost:8080/impostors and you’ll see some names:

Supervillains Club Finding Impostors Form

Click Find Impostors and you’ll get… nothing. The clue is anyone with Captain is a superhero. Based on that information, it’s time to write a new regex.

You’ll use findAll, a different method from Regex. You don’t want to check whether a regex string matches a string. You want to take out strings that match the regex string inside a string.

In RegexValidator.kt, replace the content of filterNames with:

val pattern = Regex("""Captain""")
return pattern.findAll(names).map {
  it.value
}.toList()

The findAll method returns a list of Regex objects. To get the string match, you use the value property of Regex.

Build and run the app. Submit the form, and you’ll get this result:

Supervillains Club Captain Captain Result

Not good! You could write the regex string like this: Captain (America|Marvel). It works for this case, but it’s not scalable.

What if there’s another impostor named Captain Saving the World or Captain Love? Then you’d need to rewrite your regex string.

There’s a better way. You can use predefined classes and the + quantifier.

Replace your Regex with:

val pattern = Regex("""Captain\s\w+""")

\s and \w are predefined classes. \s means any spaces, like Space or Tab. \w means any word characters.

Build and run the app. Click Find Impostors and you’ll get this result:

Supervillains Club Finding Impostors Result

Bingo- you successfully rooted them out!

Note: Sometimes you might prefer a raw string that’s easier to read. Because \w is same as [a-zA-Z_0-9] you can change your regex with this more readable one:
val pattern = Regex("""Captain\s[a-zA-Z_0-9]+""")

Captured Groups and Back-references

You captured all superheroes infiltrating Supervillains Club with the first name Captain. Now, they’re ready to convert to supervillains, but supervillains can’t use Captain as their name.

Your task now is to extract the last name from the superheroes. Later, Supervillains Club will give them a first name suitable for a supervillain.

To recap, you have to remove Captain from Captain Marvel, then give Marvel to your employer. Later, your employer will give them a different first name, like Dark Marvel. You only need to extract the last name.

Build and run the app. Open http://localhost:8080/extract then click Extract Names. Nothing happens:

Supervillains Club Extracting Names Form

To solve this problem, you’ll still use findAll. But this time, you’ll use a group in the regex string.

In RegexValidator.kt, replace the content of extractNames with:

val pattern = Regex("""Captain\s(\w+)""")
val results = pattern.findAll(names)
return results.map {
  it.groupValues[1]
}.toList()

This code is almost the same as the previous code, but there are two differences:

  1. (\w+): The regex string now has a group.
  2. groupValues[1]: You use groupValues instead of value.

groupValues[1] refers to the (\w+) group in the regex string. Remember that (\w+) is the last name.

What is the number 1 in groupValues[1] exactly? It’s the index of the first group in groupValues array.

You don’t you use index 0 instead because it refers to the full match, such as Captain Marvel. But how big could groupValues be? It depends on the number of the groups in the regex string.

Suppose you have three groups in the regex string:

val pattern = Regex("""((Cap)tain)\s(\w+)""")

If the input string is Captain Marvel:

  • Index 0 refers to Captain Marvel.
  • Index 1 refers to Captain.
  • Index 2 refers to Cap.
  • Index 3 refers to Marvel.

You count the index from the outer groups to inner or nested groups, then from left to right. The first group refers to the full match. The second group refers to ((Cap)tain).

Then you go inside the second group to get the third group. The third group refers to (Cap). Then you move to the right, and the fourth group refers to (\w+).

Build and run the app. Then click Extract Names. You’ll get this result:

Supervillains Club Extracting Names Result

You’ve extracted the last name perfectly. Good job!

You feel proud of your code. It helps supervillains prosper in this wicked world.

But, your employer doesn’t have time to pick a custom first name for the superheroes willing to become a supervillain. They tell you to use a generic first name, Super Evil and be done with it. So Captain Marvel will become Super Evil Marvel.

Open https://localhost:8080/replace and click Replace Names. Nothing happens:

Supervillains Club Turning Superheroes into Supervillains

It’s time to convert these superheroes to supervillains!

To replace strings with regex, you use… guess what? replace. :]

Change the content of replaceNames in RegexValidator.kt with the code below:

val pattern = Regex("""Captain\s(\w+)""")
return pattern.replace(names, "Super Evil $1")

replace accepts two parameters. The first is the string against which you want to match your regex """Captain\s(\w+)""".

The second is the replacement string. It’s Super Evil $1.

The $1 in Super Evil $1 is a special character. $1 is the same as groupValues[1] in the previous example. This is a back-reference.

So the back-reference makes a reference to the captured group. The captured group is (\w+) in Captain\s(\w+).

It’s like you wrote:

val pattern = Regex("""Captain\s(\w+)""")
val results = pattern.findAll(names)
return results.map {
  "Super Evil ${it.groupValues[1]}"
}.joinToString()

But it’s much less code!

Build and run the app. Click Replace Names. You’ll see all superheroes who want to repent got a new first name:

Supervillains Club Turning Superheroes into Supervillains Result

Now with these new names, the superheroes have become supervillains officially!

Understanding Greedy Quantifiers, Possessive Quantifiers and Reluctant Quantifiers

Supervillains Club throws you another task. All supervillains have diet plans. The nutritionist in Supervillains Club has made a plan tailored for supervillains.

Open http://localhost:8080/diet and you’ll see a diet plan for supervillains in HTML format:

Supervillains Diet Plan Form

The data scientists ask you to extract the diet plan from the HTML file. In other words, you want to extract an array of the meals from the HTML string: 5kg Unicorn Meat, 2L Lava, 2kg Meteorite.

You need to match strings between the li tags. The strings could be anything. How do you match strings that can be anything?

You use . to represent any character in regex. Any character means any characters in the universe, with one exception.

. can match the line terminators or not depending on the configuration of the regex. But you don’t need to worry about this in this tutorial.

You know the ?, * and + quantifiers. These are called greedy quantifiers. You’ll know why they’re greedy soon!

What happens if you join . and *? They match any characters or any strings!

Interestingly, you can add the ? or + quantifiers to .*. The quantifiers alter the behavior of .*. You’ll experiment with all of them.

Using Greedy Quantifiers

First, you’ll use the greedy quantifier, .*.

In RegexValidator.kt, replace the content of the extractNamesFromHtml with:

val pattern = Regex("""<li>(.*)</li>""")
val results = pattern.findAll(names)
return results.map {
  it.groupValues[1]
}.toList()

Here, you use the method you used previously, findAll. The logic is simple: You use a group to capture the string between the li tags. Then you use groupValues when extracting the string.

Build and run the app, then submit the form. The result isn’t something you expect:

Supervillains Diet Plan Greedy Quantifier Result

You got a one-item array, not a three-item array. The (.*) regex pattern swallowed the </li> strings as well except the last one.

That’s why people call this quantifier greedy. It tries to match the string as much as possible while still getting the correct full match result.

But there’s another quantifier that’s greedier than the greedy quantifier: the possessive quantifier.

Using Possessive Quantifiers

Now, replace the content of extractNamesFromHtml with:

val pattern = Regex("""<li>(.*+)</li>""")
val results = pattern.findAll(names)
return results.map {
  it.groupValues[1]
}.toList()

Notice that the difference is you put + on the right of .*. This is a possessive quantifier.

Build and run the app. Then submit the form:

Supervillains Diet Plan Possessive Quantifier Result

The result is empty. The regex pattern failed to match the string because .*+ in <li>(.*+)</li> matches 5kg Unicorn Meat</li><li>2L Lava</li><li>2kg Meteorite</li>. So by the time the regex pattern moves to </li> in <li>(.*+)</li>, it can’t match the string because there is nothing to match.

What you want is a reluctant quantifier.

Using Reluctant Quantifiers

Replace the content of extractNamesFromHtml with:

val pattern = Regex("""<li>(.*?)</li>""")
val results = pattern.findAll(names)
return results.map {
  it.groupValues[1]
}.toList()

Notice, the difference is you put ? on the right of .*. This is a reluctant quantifier.

Build and run the app. Then submit the form:

Supervillains Diet Plan Reluctant Quantifier Result

This is the correct result. The (.*?) matches as few characters as possible before </li>. The (.*?) reluctantly moves forward.

Now, you successfully extracted the meals data using regex.

To get more familiar these quantifiers, check out this comparison between them and their results:

Regex Quantifiers Comparison

Understanding the Logical Operator and Escaping Regex

Supervillains Club recruits a lot of young supervillains. They also monitor the chatting between young supervillains to ensure they don’t defect. But Gen Z writes differently: they don’t respect English grammar.

This creates a problem when Supervillains Club wants to analyze Gen Z’s dialog when chatting. A Gen Z supervillain might write: “I just beat a hero :] looks like I’m good :)”.

You have to separate the dialog into sentences, but Gen Z supervillains don’t use end punctuation. Kids these days… :]

Fortunately, Supervillains Club’s NLP scientists have done their research. It looks like Gen Z uses :], :) and 🤣 as a . replacement.

Open http://localhost:8080/split and submit the form:

Supervillains Club Splitting Form

Nothing happens. It’s time to analyze Gen Z using regex!

To split the sentences using regex, you use… split!

In RegexValidator.kt, replace the content of splitSentences with:

val escapedString = Regex.escape(""":)""")
val pattern = Regex("""(:]|${escapedString})|🤣""")
return pattern.split(sentences).map {
  it.trim()
}

split uses the regex string to split the input string, looking inside the regex string for string separators. If the regex string is Y and the input string is sunny Y rainy Y cloudy, then the result is sunny, rainy and cloudy.

But you notice there’s another character, |. This is a special character in regex. It means a logical operator.

If you want to use more than one character to split, join them using |. If the regex string is Y|B, then you’ll split the sentences using Y or B.

You’ll also see you escape :) using escape:

Regex.escape(""":)""")

The ) character is special in regex. As you learned previously, it’s the character you use to create a group.

Build and run the app. Submit the form again. You’ll see this:

Supervillains Club Splitting Form Result

Your work impressed Supervillains Club. They offer to make you a supervillain.

Why not?

Your supervillain name is Regex Monster. When people have a problem, you tell them a popular regex joke: “Now you have two problems.” :]

Where to Go From Here

Download the final project using the Download Materials button at the top or bottom of the tutorial.

You learned the most common Regex methods, but there are some you didn’t tried like replaceFirst, splitToSequence and toPattern. You can consult on the Regex API documentation to learn more.

You also need to be careful with the catastrophic backtracking problem. If you write regex wrong, the regex could consume high CPU and create an outage.

You used some regex patterns but the regex pattern is vast. For example, you haven’t used the multi-lines regex pattern and named groups. Head to the Regex pattern documentation to learn more about the regex pattern.

Regex isn’t invincible. It fails in fuzzy operations like classifying the sentiment of a tweet. For this problem, you need Natural Language Processing or NLP.

Regex is complicated. You can debug the regex pattern in many regex playgrounds. One examples is regex101. Choose the Java 8 flavor in the playground.

I hope you enjoyed this tutorial! Please join the forum discussion below if you have any questions or comments.

Reviews

More like this

Contributors

Comments