Create Your Own Kotlin Playground (and Get a Data Science Head Start) with Jupyter Notebook

Learn the basics of Jupyter Notebook and how to turn it into an interactive interpreter for Kotlin. You’ll also learn about Data Frames, an important data structure for data science applications. By Joey deVilla.

Leave a rating/review
Download materials
Save for later
Share
You are currently viewing page 3 of 4 of this article. Click here to view the first page.

Getting the Data Frame’s Schema

In the world of databases, the term “schema” has a specific meaning: It’s a description of how the data in a database is organized. In a DataFrame, a schema is a description of how the data in the data frame is organized, accompanied by a small sample of the data. You can see the schema of a data frame with DataFrame‘s schema() method.

Look at df‘s schema. Run the following in a new code cell:

df.schema()

You’ll see the following output:

DataFrame with 5 observations
language             [Str]  Kotlin, Java, Swift, Objective-C, Dart
developer            [Str]  JetBrains, James Gosling, Chris Lattner et al., Tom Love and Brad Cox, Lars Bak and Kasper Lund
year_first_appeared  [Int]  2011, 1995, 2014, 1984, 2011
preferred            [Bol]  true, false, true, false, true

schema() is useful for getting a general idea about the data contained within a DataFrame. It prints the following:

  1. The number of rows in the data frame, which schema() refers to as “observations”.
  2. The name of each column in the data frame.
  3. The type of each column in the data frame.
  4. The first values stored in each column. Because df is a small data frame, schema() printed out all the values for all the columns.

You might remember that when you instantiated df, you never specified the column types. But schema() clearly shows each column has a type: language and developer are columns that contain string values, year_first_appeared contains integers, and preferred is a column of Booleans!

krangl’s dataFrameOf() method inferred the column types. You can specify column types when creating a DataFrame, but krangl uses the data you provide to determine the appropriate types so you don’t have to. This feature makes the krangl feel more dynamically typed — like pandas and deplyr — providing a more Python- or R-like experience.

Getting the Data Frame’s Dimensions and Column Names

schema() is good for diagnostics, but it isn’t useful if you want to programatically find how many rows and columns are in a DataFrame or what its column names are. Fortunately, DataFrame has useful properties for this purpose:

  • nrow: The number of rows in the data frame.
  • ncol: The number of columns in the data frame.
  • names: A list of strings specifying the names of the columns, going from left to right.

Use these properties. Run the following in a new code cell:

println("The data frame has ${df.nrow} rows and ${df.ncol} columns.")
println("The column indices and names are:")
df.names.forEachIndexed { index, name ->
    println("$index: $name")
}

You’ll see this output:

The data frame has 5 rows and 4 columns.
The column indices and names are:
0: language
1: developer
2: year_first_appeared
3: preferred

Examining the Data Frame’s Columns

The cols property of DataFrame returns a list of objects representing each column, going from left to right. Use it to take a closer look at df‘s columns.

Run the following in a new code cell:

df.cols.forEachIndexed { index, column ->
    println("$index: $column")
}

You’ll see this result:

0: language [Str][5]: Kotlin, Java, Swift, Objective-C, Dart
1: developer [Str][5]: JetBrains, James Gosling, Chris Lattner et al., Tom Love and Brad Cox, Lars Bak ...
2: year_first_appeared [Int][5]: 2011, 1995, 2014, 1984, 2011
3: preferred [Bol][5]: true, false, true, false, true

Each column object in the list returned by the col property is an instance of the DataCol class. DataCol has properties and methods that let you examine a column in greater detail and even perform some analysis on its contents.

For now, stick to using two DataCol properties:

  • name: The name of the column.
  • length: The number of items or rows in the column.

Run the following in a new code cell:

df.cols.forEachIndexed { index, column ->
    println("$index: name: ${column.name}   length: ${column.length}")
}

It will produce the following output:

0: name: language   length: 5
1: name: developer   length: 5
2: name: year_first_appeared   length: 5
3: name: preferred   length: 5

DataFrame has some syntactic sugar that makes it easier to work with columns. Although you could access df‘s first column using the syntax df.cols[0], it’s much simpler to access it using array syntax:

df[0] // Same thing as df.cols[0]

If you’d rather access a column by name, DataFrame also implements map syntax. For example, to access df‘s first column, which is named language, you can use this code:

df["language"] // Column 0's name is language,
               // so this is equivalent to
               // df.cols[0] and df[0]

Examining the Data Frame’s Rows

Like DataFrame has a cols property to access its columns, it also has a rows property. It returns an Iterable that lets you access a collection object representing each row, going from top to bottom. Use it to take a closer look at df‘s rows.

Run the following in a new code cell:

df.rows.forEachIndexed { index, row ->
    println("$index: $row")
}

You should see this output:

0: {language=Kotlin, developer=JetBrains, year_first_appeared=2011, preferred=true}
1: {language=Java, developer=James Gosling, year_first_appeared=1995, preferred=false}
2: {language=Swift, developer=Chris Lattner et al., year_first_appeared=2014, preferred=true}
3: {language=Objective-C, developer=Tom Love and Brad Cox, year_first_appeared=1984, preferred=false}
4: {language=Dart, developer=Lars Bak and Kasper Lund, year_first_appeared=2011, preferred=true}

Each row object is an instance of DataFrameRow, which is simply an alias for Map<String, Any?>, where each key-value pair represents the name of a column and its corresponding value. For example, you could modify the loop you just ran to print only each programming language and the year in which it first appeared using this code:

df.rows.forEachIndexed { index, row ->
    println("$index: name: ${row["language"]}   premiered: ${row["year_first_appeared"]}")
}

You’ll see this output:

0: name: Kotlin   premiered: 2011
1: name: Java   premiered: 1995
2: name: Swift   premiered: 2014
3: name: Objective-C   premiered: 1984
4: name: Dart   premiered: 2011

Because rows returns an Iterable rather than a List, you need to use the elementAt() method to access a row by its index number. For example, the following code retrieves row 1 of df:

df.rows.elementAt(1) // Retrieve row 1

Accessing Data Frame “Cells” by Column and Row

DataFrame provides a convenient column-row syntax for accessing individual “cells”.

Suppose you wanted to get the value in the year_first_appeared column for row 3. As mentioned before, you could access that column in several ways:

// These all produce the same result
df.cols[2]
df[2]
df["year_first_appeared"]

By adding a subscript to any of the lines above, you can access a specific row for that column. Here’s how you can access row 3 of the year_first_appeared column:

// These all access the value in the "year_first_appeared" column
// of row 3
df.cols[2][3]
df[2][3]
df["year_first_appeared"][3]