PySparkのデータ型について

2024 5/22

2024年5月22日 2024年5月22日

PySparkは、大規模データの処理と分析を効率的に行うための強力なツールであり、データフレームを操作する際に様々なデータ型をサポートしています。PySparkのデータ型を理解することは、データ処理と分析の効率を最大限に引き出すために非常に重要です。この記事では、PySparkで使用される主要なデータ型とその使用方法について詳しく解説します。

1. 基本的なデータ型

PySparkのデータフレームは、スキーマによって定義されたカラムとデータ型を持ちます。基本的なデータ型には以下のものがあります。

1.1 数値型

IntegerType: 整数型
LongType: 長整数型
FloatType: 単精度浮動小数点型
DoubleType: 倍精度浮動小数点型

from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, FloatType, DoubleType

# SparkSessionの作成
spark = SparkSession.builder.appName("DataTypesExample").getOrCreate()

# サンプルデータの作成
data = [(1, 1.0, 1.0), (2, 2.5, 2.5), (3, 3.5, 3.5)]
columns = ["int_col", "float_col", "double_col"]

df = spark.createDataFrame(data, schema=columns)
df = df.withColumn("int_col", df["int_col"].cast(IntegerType()))
df = df.withColumn("float_col", df["float_col"].cast(FloatType()))
df = df.withColumn("double_col", df["double_col"].cast(DoubleType()))

df.printSchema()
df.show()

root
 |-- int_col: integer (nullable = true)
 |-- float_col: float (nullable = true)
 |-- double_col: double (nullable = true)

+-------+---------+----------+
|int_col|float_col|double_col|
+-------+---------+----------+
|      1|      1.0|       1.0|
|      2|      2.5|       2.5|
|      3|      3.5|       3.5|
+-------+---------+----------+

1.2 文字列型

StringType: 文字列型

from pyspark.sql.types import StringType

# サンプルデータの作成
data = [("Alice",), ("Bob",), ("Cathy",)]
columns = ["name"]

df = spark.createDataFrame(data, schema=columns)
df = df.withColumn("name", df["name"].cast(StringType()))

df.printSchema()
df.show()

root
 |-- name: string (nullable = true)

+-----+
| name|
+-----+
|Alice|
|  Bob|
|Cathy|
+-----+

1.3 ブール型

BooleanType: ブール型

from pyspark.sql.types import BooleanType

# サンプルデータの作成
data = [(True,), (False,), (True,)]
columns = ["bool_col"]

df = spark.createDataFrame(data, schema=columns)
df = df.withColumn("bool_col", df["bool_col"].cast(BooleanType()))

df.printSchema()
df.show()

root
 |-- bool_col: boolean (nullable = true)

+--------+
|bool_col|
+--------+
|    true|
|   false|
|    true|
+--------+

2. 日付とタイムスタンプ型

PySparkは、日付とタイムスタンプを扱うためのデータ型を提供しています。

DateType: 日付型
TimestampType: タイムスタンプ型

from pyspark.sql.types import DateType, TimestampType
from pyspark.sql.functions import current_date, current_timestamp

# サンプルデータの作成
data = [(1,), (2,), (3,)]
columns = ["id"]

df = spark.createDataFrame(data, schema=columns)
df = df.withColumn("current_date", current_date().cast(DateType()))
df = df.withColumn("current_timestamp", current_timestamp().cast(TimestampType()))

df.printSchema()
df.show(truncate=False)

root
 |-- id: long (nullable = true)
 |-- current_date: date (nullable = true)
 |-- current_timestamp: timestamp (nullable = true)

+---+------------+-----------------------+
|id |current_date|current_timestamp       |
+---+------------+-----------------------+
|1  |2023-01-01  |2023-01-01 12:00:00.000|
|2  |2023-01-01  |2023-01-01 12:00:00.000|
|3  |2023-01-01  |2023-01-01 12:00:00.000|
+---+------------+-----------------------+

3. 複合型

3.1 配列型

ArrayType: 配列型

from pyspark.sql.types import ArrayType

# サンプルデータの作成
data = [([1, 2, 3],), ([4, 5, 6],), ([7, 8, 9],)]
columns = ["array_col"]

df = spark.createDataFrame(data, schema=columns)
df = df.withColumn("array_col", df["array_col"].cast(ArrayType(IntegerType())))

df.printSchema()
df.show(truncate=False)

root
 |-- array_col: array (nullable = true)
 |    |-- element: integer (containsNull = true)

+---------+
|array_col|
+---------+
|[1, 2, 3]|
|[4, 5, 6]|
|[7, 8, 9]|
+---------+

3.2 マップ型

MapType: マップ型

from pyspark.sql.types import MapType

# サンプルデータの作成
data = [({"key1": 1, "key2": 2},), ({"key1": 3, "key2": 4},), ({"key1": 5, "key2": 6},)]
columns = ["map_col"]

df = spark.createDataFrame(data, schema=columns)
df = df.withColumn("map_col", df["map_col"].cast(MapType(StringType(), IntegerType())))

df.printSchema()
df.show(truncate=False)

root
 |-- map_col: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = true)

+---------------+
|map_col        |
+---------------+
|{key1 -> 1, key2 -> 2}|
|{key1 -> 3, key2 -> 4}|
|{key1 -> 5, key2 -> 6}|
+---------------+

3.3 構造体型

StructType: 構造体型

from pyspark.sql.types import StructType, StructField

# サンプルデータの作成
data = [(1, {"field1": "value1", "field2": 10}), (2, {"field1": "value2", "field2": 20})]
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("struct_col", StructType([
        StructField("field1", StringType(), True),
        StructField("field2", IntegerType(), True)
    ]), True)
])

df = spark.createDataFrame(data, schema=schema)

df.printSchema()
df.show(truncate=False)

root
 |-- id: integer (nullable = true)
 |-- struct_col: struct (nullable = true)
 |    |-- field1: string (nullable = true)
 |    |-- field2: integer (nullable = true)

+---+-------------+
|id |struct_col   |
+---+-------------+
|1  |{value1, 10} |
|2  |{value2, 20} |
+---+-------------+

まとめ

PySparkは、多様なデータ型をサポートしており、これを活用することで効率的かつ柔軟なデータ処理が可能です。この記事では、基本的な数値型、文字列型、ブール型、日付・タイムスタンプ型、さらに複合型（配列型、マップ型、構造体型）について詳しく解説しました。これらのデータ型を適切に使用することで、データの正確な表現と効率的な処理を実現できます。