Beginning Data Science with Jupyter Notebook and Kotlin

This tutorial introduces the concepts of Data Science, using Jupyter Notebook and Kotlin. You’ll learn how to set up a Jupyter notebook, load krangl for Kotlin and use it in data science utilizing a built-in sample data. By Joey deVilla.

Leave a rating/review
Download materials
Save for later
Share
You are currently viewing page 4 of 4 of this article. Click here to view the first page.

Summarizing

Summarizing is the act of applying calculations to a grouped data frame on a per-group basis. Calculate sleep statistics for the grouped data frame.

Run the following code in a new code cell:

groupedData
 .summarize(
 "Mean daily total sleep (hours)" to { it["sleep_total"].mean(removeNA=true) },
 "Mean daily REM sleep (hours)" to { it["sleep_rem"].mean(removeNA=true) }
 )

The output, as the summarize() method name suggests, is a nice summary:

A table summarizing the sleepData data frame by grouping animals by vore and listing their mean daily total sleep and mean daily REM sleep. Text at the bottom says that the data frame's shape is 5 by 3.

Now, improve on the summary by sorting it.

Run the following in a new code cell:

groupedData
 .summarize(
 "Mean daily total sleep (hours)" to { it["sleep_total"].mean(removeNA=true) },
 "Mean daily REM sleep (hours)" to { it["sleep_rem"].mean(removeNA=true) }
 )
 .sortedBy("Mean daily total sleep (hours)")

Now the summary lists the groups sorted by how much sleep they get, from least to most:

A table summarizing the sleepData data frame by grouping animals by vore and listing their mean daily total sleep and mean daily REM sleep, listed in order of increasing sleep. Text at the bottom says that the data frame's shape is 5 by 3.

From this summary, you’ll see that herbivores sleep the least, carnivores and omnivores get a little more sleep, and insectivores get the most sleep, spending more time asleep than awake.

The summary might lead you to a set of hypotheses that you might want to test with more experiments. One of the more obvious ones is that herbivores are what carnivores and omnivores eat, which means that they have to stay alert and sleep less.

In data science, you’ll find that an often-used workflow is one that consists of doing the following to a data frame in this order:

  • Filtering / Selecting
  • Grouping
  • Summarizing
  • Sorting

Importing Data

While you can load data into a data frame using code, it’s quite unlikely that you’ll be doing it that way. In most cases, you’ll work with data saved in a commonly-used file format.

Data entry is a big and often overlooked part of data science, and spreadsheets remain the preferred data entry tool, even after all these years. They make it easy to enter tables of data, and they’ve been around long enough for them to become a tool that even casual computer users understand.

While spreadsheet applications save their files in a proprietary format, they can also export their data in a couple of standard plain-text formats that other applications can easily read: .csv and .tsv.

Reading .csv Data

One of the most common file formats for data is .csv, which is short for comma-separated value.

Each line in a .csv file represents a row of data, and within each line, each column value is delineated by commas. The first row contains column titles by default, while the remaining rows contain the data.

For example, here’s how the data frame you created earlier would be represented in .csv form:

language,developer,year_first_appeared,preferred
Kotlin,JetBrains,2011,true
Java,James Gosling,1995,false
Swift,Chris Lattner et al.,2014,true
Objective-C,Tom Love and Brad Cox,1984,false
Dart,Lars Bak and Kasper Lund,2011,true

Given a URL for a remote file, the readCSV() method of the DataFrame class reads .csv data and uses it to create a new data frame.

Enter and run the following in a new code cell:

val ramenRatings = DataFrame.readCSV("https://koenig-media.raywenderlich.com/uploads/2021/07/ramen-ratings.csv")
ramenRatings

You’ll see the following result:

A table displaying the first 6 rows of the ramenRatings data frame. Text at the bottom says there are 2574 more rows and that the data frame's shape is 2580 by 11.

You could’ve just as easily downloaded the file and read it locally using readCSV(), as it’s versatile enough to work with both URLs and local filepaths.

Reading .tsv Data

The .csv format has one major limitation; since it uses commas as a data separator, the data can’t contain commas. This rules out certain kinds of data, especially text data containing full sentences.

This is where the .tsv format is useful. Rather than delimiting data with commas, the .tsv format uses tab characters, which are control characters that aren’t typically part of text created by humans.

The DataFrame class’ readTSV() method works like readCSV(), except that it initializes a data frame with the data from a .tsv file.

Run this code in a new code cell:

val restaurantReviews = DataFrame.readTSV("https://koenig-media.raywenderlich.com/uploads/2021/07/restaurant-reviews.tsv")
restaurantReviews

It should produce the following output:

A table displaying the first 6 rows of the restaurantReviews data frame. Text at the bottom says there are 994 more rows and that the data frame's shape is 1000 by 2.

You can see that any written text can appear.

Where to Go From Here?

You can download the Jupyter Notebook files containing all the code from the exercises above by clicking on the Download Materials button at the top or bottom of the tutorial.

You’ve completed your first steps in data science with Kotlin. The data frame basics covered here are the basis of many Jupyter Notebook projects, and they’re just the beginning.

There’s a lot more ground you can cover while exploring Kotlin-powered data science. Here are a few good starting points:

We hope you enjoyed this tutorial. If you have any questions or comments, please join the forum discussion below!