Regular Expressions in Kotlin
Learn how to improve your strings manipulation with the power of regular expressions in Kotlin. You’ll love them!
Version
- Kotlin 1.5, Other, IntelliJ IDEA

String manipulation and validation are common operations you’ll encounter when writing Kotlin code. For example, you may want to validate a phone number, email or password, among other things. While the string
class has methods for modifying and validating strings, they’re not enough.
For example, you can use endsWith
to check whether a string ends with numbers. However, there are cases where using only simple string methods could make you write more code than needed.
Imagine you want to check whether a string is a phone number. The pattern for a phone number is three number strings separated by dashes. For example, 333-212-555 is a phone number, but 3333/2222 isn’t.
In this case, using string methods would work but could be very cumbersome. You’d need to split the string then check whether each result is a number string.
Regular expressions, or regex, are tools that can help you solve string validation and string manipulation problems in a more compact way.
In this tutorial, you’ll learn:
- How to create a regex pattern string and a
Regex
object. -
Regex
‘s methods. - Character classes, groups, quantifiers and boundaries.
- Predefined classes and groups.
- Greedy quantifiers, reluctant quantifiers and possessive quantifiers.
- Logical operator and escaping a regex pattern.
If you’re new to Kotlin, check out this Kotlin Apprentice book. For more information about Spring web development, you can read Kotlin and Spring Boot: Getting Started.
Getting Started
Download the materials using the Download Materials button at the top or bottom of this tutorial. Open IntelliJ IDEA and import the starter project.
Before you start working, take a moment to learn the app’s backstory.
Understanding the Backstory
Superheroes are becoming more dominant and popular, which worries supervillains. So they decided to join the forces to build Supervillains Club, a club dedicated to advancing the interest of villains and recruiting people to become supervillains. With more supervillains going for retirement soon, they need fresh blood.
Supervillains Club has a rudimentary web app, but it’s not sufficient. Right now, anyone can register in the web app, including superheroes masquerading as supervillains. The web app needs string validations to filter them and other features that require string methods.
You just accepted a job as a software engineer at Supervillains Club. Your task is to write functions using regex to validate, modify and process strings. By your work, supervillains will rise to glory!
But first, you need to familiarize yourself with the web app, a Spring web application. Take a moment to examine the code.
The main files are:
- SupervillainsClubApplication.kt: The Spring app.
- SupervillainsClubController.kt: The controller which maps the URL paths to the methods serving the requests.
- SupervillainsService.kt: The thin layer between the controller and the model.
- Supervillain.kt: The model of a supervillain.
- RegexValidator.kt: The core library that processes strings with regex. This is where you’ll write your code.
There are also test files in the tests directory, SupervillainsClubControllerTest.kt and RegexValidatorTest.kt.
Building and Running the Web App
Build and run the web app. Then open http://localhost:8080. You’ll see the welcome screen and read the Supervillains Club mission:
Click Register to enter the signup page. You’ll see a form asking for your name and description:
Choose batman
for Name and A superhero from DC.
for Description. Then click Register. You’ll see the following screen:
Mayday, mayday: batman
infiltrated Supervillains Club!
You need to find a solution for that as soon as possible!
It’s tempting to check whether a particular string is batman
. But you got a requirement to prevent batman
, batwoman
, catman
and catwoman
from infiltrating Supervillains Club. It has to take case-insensitivity into account, so Batman
is a no-go as well.
You may think of this solution:
fun validateName(name: String): Boolean {
val lowerName = name.toLowerCase()
if (lowerName=="batman" || lowerName=="batwoman" || lowerName=="catman" || lowerName=="catwoman") {
return false
}
return true
}
That works, but it’s not scalable. What if you want to prevent strings with stretched vowels, like batmaaan
and batmaaaaaaaaaaaaaaan
? The if
condition swould be too unmanageable.
It’s time to put regex to good use: preventing superheroes from entering Supervillains Club. :]
The Regex Object
Regex pattern is a string with a particular syntax and has rules. It’s complex and confusing, even for veterans. In this tutorial, you’ll learn common rules that cover 80% of your daily needs.
A regex string can be like a normal string, for example, batman
. But you can’t do anything with this string. You have to convert it into a Regex
object.
To convert this regex string into a Regex
object, you feed the string as the argument to the Regex
class constructor:
val pattern = Regex("batman")
Now, you have a Regex
object.
Regex
has a couple of methods. The method that’s suitable for your purpose is containsMatchIn
.
Find validateName
in RegexValidator.kt. Replace the comment // TODO: Task 1
with:
val pattern = Regex("batman")
if (pattern.containsMatchIn(name)) {
return false
}
containsMatchIn
returns true
if batman
is in the name
variable.
Build and run the app. Open http://localhost:8080/register. Fill batman
for Name and A superhero from DC.
for Description, then click Register:
Nice! batman
can’t enter Supervillains Club. However, if you try with Batman
, your validation will fail.
You’ll use RegexOption
to solve that problem.
Using RegexOption
You can make regex case-insensitive by passing RegexOption.IGNORE_CASE
as the second parameter to Regex
‘s constructor. Replace Regex(“batman”)
with:
val pattern = Regex("batman", RegexOption.IGNORE_CASE)
Build and rerun the app. Choose Batman
for Name then submit the form.
This time, the validation works perfectly. You finally prevented Batman
from entering Supervillains Club for good.
RegexOption.IGNORE_CASE
you may have other regex options.
You can use more than one RegexOption
. Can you guess how? :]
[spoiler title=”Solution”]
You can pass a set of RegexOption
s instead of a single value:
val pattern = Regex("batman", setOf(RegexOption.IGNORE_CASE, RegexOption.MULTILINE))
[/spoiler]
You can read more about RegexOption
in the official documentation.
Flag expression
RegexOption
‘s purpose is to alter the behavior of a regex. But you can achieve the same results without using RegexOption
by writing the rule in the regex string, like this:
val pattern = Regex("(?i)batman(?-i)")
You get the same result as using Regex.IGNORE_CASE
.
This strange syntax is a flag expression. Flag expressions have special meanings. (?i)batman(?-i)
doesn’t mean the regex string matches the (?i)batman(?-i)
string exactly.
The regex engine interprets the flag expressions differently than normal characters. (?i)
tells the regex engine to treat the characters case-insensitively from now on. On the other hand, (?-i)
tells the regex engine to treat the characters case-sensitively from this point on.
So (?i)b(?-i)atman
means only b
is case-insensitive. The rest of the characters are case-sensitive.
But for this example, you’ll use only RegexOption
.
Understanding Character Classes, Groups, Quantifiers and Boundaries
Another problem appears. A superhero called catman
enters Supervillains Club. How do you forbid both catman
and batman
?
With a standard string method, you can use the if
condition with a logical operator. But you’ll use regex.
You want to check whether the string is batman
or catman
. Notice, only one character is different. The rest characters, atman
, are the same.
Using Character Classes
You can use a character class to group b
and c
. Replace your pattern
line with:
val pattern = Regex("[bc]atman", RegexOption.IGNORE_CASE)
The [
and ]
create a character class. [bc]
means either b
or c
. [aiueo]
means vowels.
There are special characters inside square brackets. If you want to negate the characters, you can use ^
. [^aiueo]
means any characters other than vowels.
You can also use -
to create a range of characters. [a-z]
means a
, b
, c
until z
.
Build and run the app. Try to input catman
. The validation works flawlessly.
Next, you’ll take a look at groups and quantifiers.
Using Groups and Quantifiers
All is well until batwoman
breaks into Supervillains Club. Now you need to prevent batman
and batwoman
as well. Notice, the difference is the wo
string: You can’t use the character class to solve this problem.
bat[wo]man
means batwman
or batoman
. It doesn’t match batwoman
.
What you want is a group.
Add this new rule to the existing regex syntax. Replace your pattern
line with:
val pattern = Regex("[bc]at(wo)?man", RegexOption.IGNORE_CASE)
Here, you use (
and )
to create a group. (wo)
means a group of the wo
string. Groups make characters a single unit.
You want to make this group optional and you apply ?
after the group. The regex string is bat(wo)?man
.
?
is a quantifier. A quantifier defines how many occurrences of a unit. There are a few of varieties of quantifiers in regex:
-
?
: 0 or 1 occurrence. -
+
: 1 or unlimited occurrences. -
*
: 0 or unlimited occurrences.
You could use quantifiers to match occurrences of a unit:
-
ba+
matchesba
andbaaaaa
fully, but doesn’t matchb
. -
ba*
matchesb
,ba
andbaaaaa
fully. -
ba?
matchesb
andba
fully, but only matchesbaaaa
partially.
In your group, (wo)?
, the syntax means the group on the left side of ?
is either one occurrence or nothing.
That’s the purpose of the group. w
and o
in wo
aren’t separable.
Build and run the app. Check to see that batwoman
and catwoman
can’t enter Supervillains Club.
What if you hadn’t used a group so that the regex string should have been batwo?man
?
[spoiler title=”Solution”]
That means the ?
modifier only applies to the o
character. So batwo?man
matches batwman
.
You don’t want this. You want either batman
or batwoman
, but not batwman
.
[/spoiler]
Using Boundaries
You’re satisfied with your superb code: You protected Supervillains Club from superheroes. Then one day, a supervillain named I'm not Batman
tries to register, and the validation stops the supervillain.
You get a complaint from your employer.
Now, you need to add a logic that the regex string needs to match batman
, catman
, batwoman
and catwoman
only if they appear at the beginning of the string.
Use a boundary to solve this problem. Add ^
in the front of the regex string. Then replace your pattern
line with:
val pattern = Regex("^[bc]at(wo)?man", RegexOption.IGNORE_CASE)
The ^
character doesn’t have the same meaning as the ^
character inside the brackets. ^bat
means bat
at the beginning of the string. [^bat]
means any characters other than b
, a
and t
.
Build and run the app. Now I'm not Batman
can register successfully in Supervillains Club.
$
. So bat$
means bat
at the end of the string.
Regex Helper Tools from IntelliJ IDEA
Sometimes when writing your regex pattern, you want to check if it works as soon as possible, even without running your app. For this purpose, use regex helper tools from IntelliJ IDEA.
Move your caret to the regex pattern and press Alt-Enter on Linux/Windows or Option-Enter on Mac:
You have two helper tools dealing with regex. One edits the regex fragment, and the other checks the regex pattern.
Choose Check RegExp:
You have a form to validate an input string with your regex pattern. If the input string matches the regex pattern, you’ll see a green check mark.
If you put in an invalid input string:
You’ll see a red exclamation mark.
If you find that your regex pattern doesn’t work as expected, go back to your regex pattern. Press Alt-Enter or Option+Enter again:
Then choose Edit RegExp Fragment:
You’ll see a dedicated editor for your regex pattern where you can edit your regex pattern and get hints. For example, delete )
after the alphabet o
:
You’ll see a warning about the missing )
.
For this regex pattern, an editor is overkill. But it can be handy while editing a complex regex.
Understanding Predefined Classes and Groups
You were working on the signup page when you got a distress call: Some superheroes have infiltrated Supervillains Club, and you need to root them out!
Open http://localhost:8080/impostors and you’ll see some names:
Click Find Impostors and you’ll get… nothing. The clue is anyone with Captain
is a superhero. Based on that information, it’s time to write a new regex.
You’ll use findAll
, a different method from Regex
. You don’t want to check whether a regex string matches a string. You want to take out strings that match the regex string inside a string.
In RegexValidator.kt, replace the content of filterNames
with:
val pattern = Regex("""Captain""")
return pattern.findAll(names).map {
it.value
}.toList()
The findAll
method returns a list of Regex
objects. To get the string match, you use the value
property of Regex
.
Build and run the app. Submit the form, and you’ll get this result:
Not good! You could write the regex string like this: Captain (America|Marvel)
. It works for this case, but it’s not scalable.
What if there’s another impostor named Captain Saving the World
or Captain Love
? Then you’d need to rewrite your regex string.
There’s a better way. You can use predefined classes and the +
quantifier.
Replace your Regex
with:
val pattern = Regex("""Captain\s\w+""")
\s
and \w
are predefined classes. \s
means any spaces, like Space or Tab. \w
means any word characters.
Build and run the app. Click Find Impostors and you’ll get this result:
Bingo- you successfully rooted them out!
\w
is same as [a-zA-Z_0-9]
you can change your regex with this more readable one:
val pattern = Regex("""Captain\s[a-zA-Z_0-9]+""")
Captured Groups and Back-references
You captured all superheroes infiltrating Supervillains Club with the first name Captain
. Now, they’re ready to convert to supervillains, but supervillains can’t use Captain
as their name.
Your task now is to extract the last name from the superheroes. Later, Supervillains Club will give them a first name suitable for a supervillain.
To recap, you have to remove Captain
from Captain Marvel
, then give Marvel
to your employer. Later, your employer will give them a different first name, like Dark Marvel
. You only need to extract the last name.
Build and run the app. Open http://localhost:8080/extract then click Extract Names. Nothing happens:
To solve this problem, you’ll still use findAll
. But this time, you’ll use a group in the regex string.
In RegexValidator.kt, replace the content of extractNames
with:
val pattern = Regex("""Captain\s(\w+)""")
val results = pattern.findAll(names)
return results.map {
it.groupValues[1]
}.toList()
This code is almost the same as the previous code, but there are two differences:
-
(\w+)
: The regex string now has a group. -
groupValues[1]
: You usegroupValues
instead ofvalue
.
groupValues[1]
refers to the (\w+)
group in the regex string. Remember that (\w+)
is the last name.
What is the number 1 in groupValues[1]
exactly? It’s the index of the first group in groupValues
array.
You don’t you use index 0 instead because it refers to the full match, such as Captain Marvel
. But how big could groupValues
be? It depends on the number of the groups in the regex string.
Suppose you have three groups in the regex string:
val pattern = Regex("""((Cap)tain)\s(\w+)""")
If the input string is Captain Marvel
:
- Index 0 refers to
Captain Marvel
. - Index 1 refers to
Captain
. - Index 2 refers to
Cap
. - Index 3 refers to
Marvel
.
You count the index from the outer groups to inner or nested groups, then from left to right. The first group refers to the full match. The second group refers to ((Cap)tain)
.
Then you go inside the second group to get the third group. The third group refers to (Cap)
. Then you move to the right, and the fourth group refers to (\w+)
.
Build and run the app. Then click Extract Names. You’ll get this result:
You’ve extracted the last name perfectly. Good job!
You feel proud of your code. It helps supervillains prosper in this wicked world.
But, your employer doesn’t have time to pick a custom first name for the superheroes willing to become a supervillain. They tell you to use a generic first name, Super Evil
and be done with it. So Captain Marvel
will become Super Evil Marvel
.
Open https://localhost:8080/replace and click Replace Names. Nothing happens:
It’s time to convert these superheroes to supervillains!
To replace strings with regex, you use… guess what? replace
. :]
Change the content of replaceNames
in RegexValidator.kt with the code below:
val pattern = Regex("""Captain\s(\w+)""")
return pattern.replace(names, "Super Evil $1")
replace
accepts two parameters. The first is the string against which you want to match your regex """Captain\s(\w+)"""
.
The second is the replacement string. It’s Super Evil $1
.
The $1
in Super Evil $1
is a special character. $1
is the same as groupValues[1]
in the previous example. This is a back-reference.
So the back-reference makes a reference to the captured group. The captured group is (\w+)
in Captain\s(\w+)
.
It’s like you wrote:
val pattern = Regex("""Captain\s(\w+)""")
val results = pattern.findAll(names)
return results.map {
"Super Evil ${it.groupValues[1]}"
}.joinToString()
But it’s much less code!
Build and run the app. Click Replace Names. You’ll see all superheroes who want to repent got a new first name:
Now with these new names, the superheroes have become supervillains officially!
Understanding Greedy Quantifiers, Possessive Quantifiers and Reluctant Quantifiers
Supervillains Club throws you another task. All supervillains have diet plans. The nutritionist in Supervillains Club has made a plan tailored for supervillains.
Open http://localhost:8080/diet and you’ll see a diet plan for supervillains in HTML format:
The data scientists ask you to extract the diet plan from the HTML file. In other words, you want to extract an array of the meals from the HTML string: 5kg Unicorn Meat, 2L Lava, 2kg Meteorite.
You need to match strings between the li
tags. The strings could be anything. How do you match strings that can be anything?
You use .
to represent any character in regex. Any character means any characters in the universe, with one exception.
.
can match the line terminators or not depending on the configuration of the regex. But you don’t need to worry about this in this tutorial.
You know the ?
, *
and +
quantifiers. These are called greedy quantifiers. You’ll know why they’re greedy soon!
What happens if you join .
and *
? They match any characters or any strings!
Interestingly, you can add the ?
or +
quantifiers to .*
. The quantifiers alter the behavior of .*
. You’ll experiment with all of them.
Using Greedy Quantifiers
First, you’ll use the greedy quantifier, .*
.
In RegexValidator.kt, replace the content of the extractNamesFromHtml
with:
val pattern = Regex("""<li>(.*)</li>""")
val results = pattern.findAll(names)
return results.map {
it.groupValues[1]
}.toList()
Here, you use the method you used previously, findAll
. The logic is simple: You use a group to capture the string between the li
tags. Then you use groupValues
when extracting the string.
Build and run the app, then submit the form. The result isn’t something you expect:
You got a one-item array, not a three-item array. The (.*)
regex pattern swallowed the </li>
strings as well except the last one.
That’s why people call this quantifier greedy. It tries to match the string as much as possible while still getting the correct full match result.
But there’s another quantifier that’s greedier than the greedy quantifier: the possessive quantifier.
Using Possessive Quantifiers
Now, replace the content of extractNamesFromHtml
with:
val pattern = Regex("""<li>(.*+)</li>""")
val results = pattern.findAll(names)
return results.map {
it.groupValues[1]
}.toList()
Notice that the difference is you put +
on the right of .*
. This is a possessive quantifier.
Build and run the app. Then submit the form:
The result is empty. The regex pattern failed to match the string because .*+
in <li>(.*+)</li>
matches 5kg Unicorn Meat</li><li>2L Lava</li><li>2kg Meteorite</li>
. So by the time the regex pattern moves to </li>
in <li>(.*+)</li>
, it can’t match the string because there is nothing to match.
What you want is a reluctant quantifier.
Using Reluctant Quantifiers
Replace the content of extractNamesFromHtml
with:
val pattern = Regex("""<li>(.*?)</li>""")
val results = pattern.findAll(names)
return results.map {
it.groupValues[1]
}.toList()
Notice, the difference is you put ?
on the right of .*
. This is a reluctant quantifier.
Build and run the app. Then submit the form:
This is the correct result. The (.*?)
matches as few characters as possible before </li>
. The (.*?)
reluctantly moves forward.
Now, you successfully extracted the meals data using regex.
To get more familiar these quantifiers, check out this comparison between them and their results:
Understanding the Logical Operator and Escaping Regex
Supervillains Club recruits a lot of young supervillains. They also monitor the chatting between young supervillains to ensure they don’t defect. But Gen Z writes differently: they don’t respect English grammar.
This creates a problem when Supervillains Club wants to analyze Gen Z’s dialog when chatting. A Gen Z supervillain might write: “I just beat a hero :] looks like I’m good :)”.
You have to separate the dialog into sentences, but Gen Z supervillains don’t use end punctuation. Kids these days… :]
Fortunately, Supervillains Club’s NLP scientists have done their research. It looks like Gen Z uses :]
, :)
and 🤣
as a .
replacement.
Open http://localhost:8080/split and submit the form:
Nothing happens. It’s time to analyze Gen Z using regex!
To split the sentences using regex, you use… split
!
In RegexValidator.kt, replace the content of splitSentences
with:
val escapedString = Regex.escape(""":)""")
val pattern = Regex("""(:]|${escapedString})|🤣""")
return pattern.split(sentences).map {
it.trim()
}
split
uses the regex string to split the input string, looking inside the regex string for string separators. If the regex string is Y
and the input string is sunny Y rainy Y cloudy
, then the result is sunny
, rainy
and cloudy
.
But you notice there’s another character, |
. This is a special character in regex. It means a logical operator.
If you want to use more than one character to split, join them using |
. If the regex string is Y|B
, then you’ll split the sentences using Y
or B
.
You’ll also see you escape :)
using escape
:
Regex.escape(""":)""")
The )
character is special in regex. As you learned previously, it’s the character you use to create a group.
Build and run the app. Submit the form again. You’ll see this:
Your work impressed Supervillains Club. They offer to make you a supervillain.
Why not?
Your supervillain name is Regex Monster
. When people have a problem, you tell them a popular regex joke: “Now you have two problems.” :]
Where to Go From Here
Download the final project using the Download Materials button at the top or bottom of the tutorial.
You learned the most common Regex
methods, but there are some you didn’t tried like replaceFirst
, splitToSequence
and toPattern
. You can consult on the Regex API documentation to learn more.
You also need to be careful with the catastrophic backtracking problem. If you write regex wrong, the regex could consume high CPU and create an outage.
You used some regex patterns but the regex pattern is vast. For example, you haven’t used the multi-lines regex pattern and named groups. Head to the Regex pattern documentation to learn more about the regex pattern.
Regex isn’t invincible. It fails in fuzzy operations like classifying the sentiment of a tweet. For this problem, you need Natural Language Processing or NLP.
Regex is complicated. You can debug the regex pattern in many regex playgrounds. One examples is regex101. Choose the Java 8 flavor in the playground.
I hope you enjoyed this tutorial! Please join the forum discussion below if you have any questions or comments.
Comments