Python: Obtenir les sections de chemin d'URL

Question

Comment puis-je obtenir des sections de chemin spécifiques à partir d'une URL? Par exemple, je veux une fonction qui opère sur ceci:

http://www.mydomain.com/hithere?image=2934

et retourne "hithere"

ou opère sur ceci:

http://www.mydomain.com/hithere/something/else

et retourne la même chose ("hithere")

Je sais que cela utilisera probablement urllib ou urllib2 mais je ne peux pas comprendre dans les documents comment obtenir uniquement une section du chemin.

Josh Lee · Accepted Answer

Extrayez le composant de chemin de l'URL avec urlparse :

>>> import urlparse >>> path = urlparse.urlparse('http://www.example.com/hithere/something/else').path >>> path '/hithere/something/else'

Divisez le chemin en composants avec os.path . Split:

>>> import os.path >>> os.path.split(path) ('/hithere/something', 'else')

Les fonctions dirname et basename vous donnent les deux morceaux de la scission; peut-être utiliser dirname dans une boucle while:

>>> while os.path.dirname(path) != '/': ... path = os.path.dirname(path) ... >>> path '/hithere'

Iwan Aucamp · Answer

La meilleure option consiste à utiliser le module posixpath lorsque vous utilisez le composant de chemin des URL. Ce module a la même interface que os.path et fonctionne de manière cohérente sur les chemins POSIX lorsqu'il est utilisé sur des plates-formes POSIX et Windows NT.

Exemple de code:

#!/usr/bin/env python3 import urllib.parse import sys import posixpath import ntpath import json def path_parse( path_string, *, normalize = True, module = posixpath ): result = [] if normalize: tmp = module.normpath( path_string ) else: tmp = path_string while tmp != "/": ( tmp, item ) = module.split( tmp ) result.insert( 0, item ) return result def dump_array( array ): string = "[ " for index, item in enumerate( array ): if index > 0: string += ", " string += "\"{}\"".format( item ) string += " ]" return string def test_url( url, *, normalize = True, module = posixpath ): url_parsed = urllib.parse.urlparse( url ) path_parsed = path_parse( urllib.parse.unquote( url_parsed.path ), normalize=normalize, module=module ) sys.stdout.write( "{}
 --[n={},m={}]-->
 {}
".format( url, normalize, module.__name__, dump_array( path_parsed ) ) ) test_url( "http://eg.com/hithere/something/else" ) test_url( "http://eg.com/hithere/something/else/" ) test_url( "http://eg.com/hithere/something/else/", normalize = False ) test_url( "http://eg.com/hithere/../else" ) test_url( "http://eg.com/hithere/../else", normalize = False ) test_url( "http://eg.com/hithere/../../else" ) test_url( "http://eg.com/hithere/../../else", normalize = False ) test_url( "http://eg.com/hithere/something/./else" ) test_url( "http://eg.com/hithere/something/./else", normalize = False ) test_url( "http://eg.com/hithere/something/./else/./" ) test_url( "http://eg.com/hithere/something/./else/./", normalize = False ) test_url( "http://eg.com/see%5C/if%5C/this%5C/works", normalize = False ) test_url( "http://eg.com/see%5C/if%5C/this%5C/works", normalize = False, module = ntpath )

Sortie de code:

http://eg.com/hithere/something/else --[n=True,m=posixpath]--> [ "hithere", "something", "else" ] http://eg.com/hithere/something/else/ --[n=True,m=posixpath]--> [ "hithere", "something", "else" ] http://eg.com/hithere/something/else/ --[n=False,m=posixpath]--> [ "hithere", "something", "else", "" ] http://eg.com/hithere/../else --[n=True,m=posixpath]--> [ "else" ] http://eg.com/hithere/../else --[n=False,m=posixpath]--> [ "hithere", "..", "else" ] http://eg.com/hithere/../../else --[n=True,m=posixpath]--> [ "else" ] http://eg.com/hithere/../../else --[n=False,m=posixpath]--> [ "hithere", "..", "..", "else" ] http://eg.com/hithere/something/./else --[n=True,m=posixpath]--> [ "hithere", "something", "else" ] http://eg.com/hithere/something/./else --[n=False,m=posixpath]--> [ "hithere", "something", ".", "else" ] http://eg.com/hithere/something/./else/./ --[n=True,m=posixpath]--> [ "hithere", "something", "else" ] http://eg.com/hithere/something/./else/./ --[n=False,m=posixpath]--> [ "hithere", "something", ".", "else", ".", "" ] http://eg.com/see%5C/if%5C/this%5C/works --[n=False,m=posixpath]--> [ "see\", "if\", "this\", "works" ] http://eg.com/see%5C/if%5C/this%5C/works --[n=False,m=ntpath]--> [ "see", "if", "this", "works" ]

Remarques:

Sur les plates-formes Windows NT os.path est ntpath
Sur les plateformes Unix/Posix os.path est posixpath
ntpath ne gérera pas correctement les barres obliques inverses (\) (voir les deux derniers cas en code/sortie) - c'est pourquoi posixpath est recommandé.
n'oubliez pas d'utiliser urllib.parse.unquote
pensez à utiliser posixpath.normpath
La sémantique de plusieurs séparateurs de chemin (/) n'est pas définie par RFC 3986 . Cependant, posixpath réduit tous les séparateurs de chemin adjacents (c’est-à-dire qu’il traite ///, // et / de la même manière).
Même si les chemins POSIX et URL ont une syntaxe et une sémantique similaires, ils ne sont pas identiques.

Références normatives:

Navin · Answer

Solution Python 3.4+:

from urllib.parse import unquote, urlparse from pathlib import PurePosixPath url = 'http://www.example.com/hithere/something/else' PurePosixPath( unquote( urlparse( url ).path ) ).parts[1] # returns 'hithere' (the same for the URL with parameters) # parts holds ('/', 'hithere', 'something', 'else') # 0 1 2 3

Aziz Alto · Answer

Remarque dans Python3, l'importation a été modifiée en from urllib.parse import urlparse Voir documentation . Voici un exemple:

>>> from urllib.parse import urlparse >>> url = 's3://bucket.test/my/file/directory' >>> p = urlparse(url) >>> p ParseResult(scheme='s3', netloc='bucket.test', path='/my/file/directory', params='', query='', fragment='') >>> p.scheme 's3' >>> p.netloc 'bucket.test' >>> p.path '/my/file/directory'

user6729158 · Answer

import urlparse output = urlparse.urlparse('http://www.example.com/temp/something/happen/index.html').path output '/temp/something/happen/index.html' Split the path -- inbuilt rpartition func of string output.rpartition('/')[0] '/temp/something/happen'

aliasav · Answer

Une combinaison de urlparse et os.path.split fera l'affaire. Le script suivant stocke toutes les sections d'une URL dans une liste, à l'envers.

import os.path, urlparse def generate_sections_of_url(url): path = urlparse.urlparse(url).path sections = []; temp = ""; while path != '/': temp = os.path.split(path) path = temp[0] sections.append(temp[1]) return sections

Cela retournerait: ["else", "quelque chose", "hithere"]