Connectez-vous aux données S3 de PySpark

Question

J'essaie de lire un fichier JSON, à partir d'Amazon s3, pour créer un contexte d'étincelle et l'utiliser pour traiter les données.

Spark est fondamentalement dans un conteneur docker. Donc, mettre des fichiers dans le chemin du docker, c'est aussi PITA. Par conséquent poussé à S3.

Le code ci-dessous explique le reste des choses.

from pyspark import SparkContext, SparkConf conf = SparkConf().setAppName("first") sc = SparkContext(conf=conf) config_dict = {"fs.s3n.awsAccessKeyId":"**", "fs.s3n.awsSecretAccessKey":"**"} bucket = "nonamecpp" prefix = "dataset.json" filename = "s3n://{}/{}".format(bucket, prefix) rdd = sc.hadoopFile(filename, 'org.Apache.hadoop.mapred.TextInputFormat', 'org.Apache.hadoop.io.Text', 'org.Apache.hadoop.io.LongWritable', conf=config_dict)

Je reçois l'erreur suivante -

Py4JJavaError Traceback (most recent call last) <ipython-input-2-b94543fb0e8e> in <module>() 9 'org.Apache.hadoop.io.Text', 10 'org.Apache.hadoop.io.LongWritable', ---> 11 conf=config_dict) 12 /usr/local/spark/python/pyspark/context.pyc in hadoopFile(self, path, inputFormatClass, keyClass, valueClass, keyConverter, valueConverter, conf, batchSize) 558 jrdd = self._jvm.PythonRDD.hadoopFile(self._jsc, path, inputFormatClass, keyClass, 559 valueClass, keyConverter, valueConverter, --> 560 jconf, batchSize) 561 return RDD(jrdd, self) 562 /usr/local/lib/python2.7/dist-packages/py4j/Java_gateway.pyc in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, --> 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: /usr/local/lib/python2.7/dist-packages/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.
'. --> 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling z:org.Apache.spark.api.python.PythonRDD.hadoopFile. : Java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively). at org.Apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.Java:70) at org.Apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.Java:73) at Sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at Sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.Java:57) at Sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.Java:43) at Java.lang.reflect.Method.invoke(Method.Java:606) at org.Apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.Java:190) at org.Apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.Java:103) at org.Apache.hadoop.fs.s3native.$Proxy20.initialize(Unknown Source) at org.Apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.Java:272) at org.Apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.Java:2397) at org.Apache.hadoop.fs.FileSystem.access$200(FileSystem.Java:89) at org.Apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.Java:2431) at org.Apache.hadoop.fs.FileSystem$Cache.get(FileSystem.Java:2413) at org.Apache.hadoop.fs.FileSystem.get(FileSystem.Java:368) at org.Apache.hadoop.fs.Path.getFileSystem(Path.Java:296) at org.Apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.Java:256) at org.Apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.Java:228) at org.Apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.Java:304) at org.Apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) at org.Apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.Apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.Apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.Apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.Apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.Apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.Apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.Apache.spark.rdd.RDD.take(RDD.scala:1060) at org.Apache.spark.rdd.RDD.first(RDD.scala:1093) at org.Apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:202) at org.Apache.spark.api.python.PythonRDD$.hadoopFile(PythonRDD.scala:543) at org.Apache.spark.api.python.PythonRDD.hadoopFile(PythonRDD.scala) at Sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at Sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.Java:57) at Sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.Java:43) at Java.lang.reflect.Method.invoke(Method.Java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.Java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.Java:379) at py4j.Gateway.invoke(Gateway.Java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.Java:133) at py4j.commands.CallCommand.execute(CallCommand.Java:79) at py4j.GatewayConnection.run(GatewayConnection.Java:207) at Java.lang.Thread.run(Thread.Java:744)

J'ai clairement fourni aswSecretAccessKey et awsAccessId. Qu'est-ce qui ne va pas?

Franzi · Accepted Answer

J'ai résolu d'ajouter --packages org.Apache.hadoop:hadoop-aws:2.7.1 dans la commande spark-submit.

Il téléchargera tous les paquets manquants hadoop qui vous permettront d'exécuter des travaux d'étincelle avec S3.

Ensuite, dans votre travail, vous devez définir vos informations d'identification AWS telles que:

sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", aws_id) sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", aws_key)

Une autre option pour définir vos identifiants est de les définir dans spark/conf/spark-env:

#!/usr/bin/env bash AWS_ACCESS_KEY_ID='xxxx' AWS_SECRET_ACCESS_KEY='xxxx' SPARK_WORKER_CORES=1 # to set the number of cores to use on this machine SPARK_WORKER_MEMORY=1g # to set how much total memory workers have to give executors (e.g. 1000m, 2g) SPARK_EXECUTOR_INSTANCES=10 #, to set the number of worker processes per node

Plus d'informations:

Amit Kushwaha · Answer

Je suggère de passer par ce lien .

Dans mon cas, j'ai utilisé Informations d'identification de profil d'instance pour accéder aux données s3.

Informations d'identification de profil d'instance: utilisées sur les instances EC2 et transmises Via le service de métadonnées Amazon EC2. AWS SDK for Java utilise InstanceProfileCredentialsProvider pour charger ces informations d'identification.

Remarque

Les informations d'identification du profil d'instance ne sont utilisées que si AWS_CONTAINER_CREDENTIALS_RELATIVE_URI n'est pas défini. Voir EC2ContainerCredentialsProviderWrapper pour plus d'informations.

Pour pyspark, j'utilise les paramètres pour accéder au contenu s3.

def get_spark_context(app_name): # configure conf = pyspark.SparkConf() # init & return sc = pyspark.SparkContext.getOrCreate(conf=conf) # s3a config sc._jsc.hadoopConfiguration().set('fs.s3a.endpoint', 's3.eu-central-1.amazonaws.com') sc._jsc.hadoopConfiguration().set( 'fs.s3a.aws.credentials.provider', 'com.amazonaws.auth.InstanceProfileCredentialsProvider,' 'com.amazonaws.auth.profile.ProfileCredentialsProvider' ) return pyspark.SQLContext(sparkContext=sc)

Plus sur le contexte des étincelles ici .

Veuillez consulter this pour un accès de type S3.