J'ai Spark DataFrame avec take (5) premières lignes comme suit:
[Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55),
Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=2, value=638.55),
Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=3, value=638.55),
Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=4, value=638.55),
Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=5, value=638.55)]
Son schéma est défini comme:
elevDF.printSchema()
root
|-- date: timestamp (nullable = true)
|-- hour: long (nullable = true)
|-- value: double (nullable = true)
Comment puis-je obtenir les valeurs Année, Mois, Jour du champ "date"?
Depuis Spark 1.5, vous pouvez utiliser un certain nombre de fonctions de traitement des dates:
pyspark.sql.functions.year
pyspark.sql.functions.month
pyspark.sql.functions.dayofmonth
pyspark.sql.functions.dayofweek()
pyspark.sql.functions.dayofyear
pyspark.sql.functions.weekofyear()
import datetime
from pyspark.sql.functions import year, month, dayofmonth
elevDF = sc.parallelize([
(datetime.datetime(1984, 1, 1, 0, 0), 1, 638.55),
(datetime.datetime(1984, 1, 1, 0, 0), 2, 638.55),
(datetime.datetime(1984, 1, 1, 0, 0), 3, 638.55),
(datetime.datetime(1984, 1, 1, 0, 0), 4, 638.55),
(datetime.datetime(1984, 1, 1, 0, 0), 5, 638.55)
]).toDF(["date", "hour", "value"])
elevDF.select(
year("date").alias('year'),
month("date").alias('month'),
dayofmonth("date").alias('day')
).show()
# +----+-----+---+
# |year|month|day|
# +----+-----+---+
# |1984| 1| 1|
# |1984| 1| 1|
# |1984| 1| 1|
# |1984| 1| 1|
# |1984| 1| 1|
# +----+-----+---+
Vous pouvez utiliser un simple map
comme avec n'importe quel autre RDD:
elevDF = sqlContext.createDataFrame(sc.parallelize([
Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55),
Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=2, value=638.55),
Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=3, value=638.55),
Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=4, value=638.55),
Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=5, value=638.55)]))
(elevDF
.map(lambda (date, hour, value): (date.year, date.month, date.day))
.collect())
et le résultat est:
[(1984, 1, 1), (1984, 1, 1), (1984, 1, 1), (1984, 1, 1), (1984, 1, 1)]
Btw: datetime.datetime
Stocke quand même une heure, donc la garder séparément semble être une perte de mémoire.
Vous pouvez utiliser des fonctions dans pyspark.sql.functions
: fonctions comme year
, month
, etc
reportez-vous ici: https://spark.Apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
from pyspark.sql.functions import *
newdf = elevDF.select(year(elevDF.date).alias('dt_year'), month(elevDF.date).alias('dt_month'), dayofmonth(elevDF.date).alias('dt_day'), dayofyear(elevDF.date).alias('dt_dayofy'), hour(elevDF.date).alias('dt_hour'), minute(elevDF.date).alias('dt_min'), weekofyear(elevDF.date).alias('dt_week_no'), unix_timestamp(elevDF.date).alias('dt_int'))
newdf.show()
+-------+--------+------+---------+-------+------+----------+----------+
|dt_year|dt_month|dt_day|dt_dayofy|dt_hour|dt_min|dt_week_no| dt_int|
+-------+--------+------+---------+-------+------+----------+----------+
| 2015| 9| 6| 249| 0| 0| 36|1441497601|
| 2015| 9| 6| 249| 0| 0| 36|1441497601|
| 2015| 9| 6| 249| 0| 0| 36|1441497603|
| 2015| 9| 6| 249| 0| 1| 36|1441497694|
| 2015| 9| 6| 249| 0| 20| 36|1441498808|
| 2015| 9| 6| 249| 0| 20| 36|1441498811|
| 2015| 9| 6| 249| 0| 20| 36|1441498815|