Version plus rapide de la recherche de vecteurs triés (MATLAB)

Question

J'ai un code du genre suivant dans MATLAB:

indices = find([1 2 2 3 3 3 4 5 6 7 7] == 3)

Cela retourne 4,5,6 - les indices des éléments du tableau sont égaux à 3. Maintenant. mon code fait ce genre de chose avec de très longs vecteurs. Les vecteurs sont toujours triés .

Par conséquent, je voudrais une fonction qui remplace la complexité O(n) de la recherche avec O (log n), au détriment du fait que le tableau doit être trié.

Je suis conscient d’un membre, mais pour ce que je sais, il ne renvoie pas les index de tous les éléments, mais seulement le dernier (j’ai besoin de tous).

Pour des raisons de portabilité, la solution doit être exclusivement MATLAB (pas de fichiers mex compilés, etc.).

Daniel · Accepted Answer

Voici une implémentation rapide utilisant la recherche binaire. Ce fichier est également disponible sur github

function [b,c]=findInSorted(x,range) %findInSorted fast binary search replacement for ismember(A,B) for the %special case where the first input argument is sorted. % % [a,b] = findInSorted(x,s) returns the range which is equal to s. % r=a:b and r=find(x == s) produce the same result % % [a,b] = findInSorted(x,[from,to]) returns the range which is between from and to % r=a:b and r=find(x >= from & x <= to) return the same result % % For any sorted list x you can replace % [lia] = ismember(x,from:to) % with % [a,b] = findInSorted(x,[from,to]) % lia=a:b % % Examples: % % x = 1:99 % s = 42 % r1 = find(x == s) % [a,b] = myFind(x,s) % r2 = a:b % %r1 and r2 are equal % % See also FIND, ISMEMBER. % % Author Daniel Roeske <danielroeske.de> A=range(1); B=range(end); a=1; b=numel(x); c=1; d=numel(x); if A<=x(1) b=a; end if B>=x(end) c=d; end while (a+1<b) lw=(floor((a+b)/2)); if (x(lw)<A) a=lw; else b=lw; end end while (c+1<d) lw=(floor((c+d)/2)); if (x(lw)<=B) c=lw; else d=lw; end end end

DrGar · Answer

L'approche de Daniel est intelligente et sa fonction myFind2 est certainement rapide, mais des erreurs/bugs se produisent près des conditions aux limites ou dans le cas où les limites supérieure et inférieure produisent une plage en dehors de l'ensemble transmis.

En outre, comme il l'a noté dans son commentaire sur sa réponse, sa mise en œuvre présentait certaines inefficacités qui pourraient être améliorées. J'ai implémenté une version améliorée de son code, qui fonctionne plus rapidement, tout en gérant correctement les conditions aux limites. De plus, ce code inclut plus de commentaires pour expliquer ce qui se passe. J'espère que cela aide quelqu'un comme le code de Daniel m'a aidé ici!

function [lower_index,upper_index] = myFindDrGar(x,LowerBound,UpperBound) % fast O(log2(N)) computation of the range of indices of x that satify the % upper and lower bound values using the fact that the x vector is sorted % from low to high values. Computation is done via a binary search. % % Input: % % x- A vector of sorted values from low to high. % % LowerBound- Lower boundary on the values of x in the search % % UpperBound- Upper boundary on the values of x in the search % % Output: % % lower_index- The smallest index such that % LowerBound<=x(index)<=UpperBound % % upper_index- The largest index such that % LowerBound<=x(index)<=UpperBound if LowerBound>x(end) || UpperBound<x(1) || UpperBound<LowerBound % no indices satify bounding conditions lower_index = []; upper_index = []; return; end lower_index_a=1; lower_index_b=length(x); % x(lower_index_b) will always satisfy lowerbound upper_index_a=1; % x(upper_index_a) will always satisfy upperbound upper_index_b=length(x); % % The following loop increases _a and decreases _b until they differ % by at most 1. Because one of these index variables always satisfies the % appropriate bound, this means the loop will terminate with either % lower_index_a or lower_index_b having the minimum possible index that % satifies the lower bound, and either upper_index_a or upper_index_b % having the largest possible index that satisfies the upper bound. % while (lower_index_a+1<lower_index_b) || (upper_index_a+1<upper_index_b) lw=floor((lower_index_a+lower_index_b)/2); % split the upper index if x(lw) >= LowerBound lower_index_b=lw; % decrease lower_index_b (whose x value remains \geq to lower bound) else lower_index_a=lw; % increase lower_index_a (whose x value remains less than lower bound) if (lw>upper_index_a) && (lw<upper_index_b) upper_index_a=lw;% increase upper_index_a (whose x value remains less than lower bound and thus upper bound) end end up=ceil((upper_index_a+upper_index_b)/2);% split the lower index if x(up) <= UpperBound upper_index_a=up; % increase upper_index_a (whose x value remains \leq to upper bound) else upper_index_b=up; % decrease upper_index_b if (up<lower_index_b) && (up>lower_index_a) lower_index_b=up;%decrease lower_index_b (whose x value remains greater than upper bound and thus lower bound) end end end if x(lower_index_a)>=LowerBound lower_index = lower_index_b; else lower_index = lower_index_a; end if x(upper_index_b)<=UpperBound upper_index = upper_index_a; else upper_index = upper_index_b; end

Notez que la version améliorée de la fonction Daniels searchFor est désormais simplement:

function [lower_index,upper_index] = mySearchForDrGar(x,value) [lower_index,upper_index] = myFindDrGar(x,value,value);

chappjc · Answer

ismember vous donnera tous les index si vous regardez la première sortie:

>> x = [1 2 2 3 3 3 4 5 6 7 7]; >> [tf,loc]=ismember(x,3); >> inds = find(tf)

inds =

 4 5 6

Vous avez juste besoin d'utiliser le bon ordre des entrées.

Notez qu'il existe une fonction d'assistance utilisée par ismember que vous pouvez appeler directement:

% ISMEMBC - S must be sorted - Returns logical vector indicating which % elements of A occur in S tf = ismembc(x,3); inds = find(tf);

Utiliser ismembc permettra de gagner du temps de calcul puisque ismember appelle issorted en premier, mais le contrôle sera omis.

Notez que les versions plus récentes de matlab ont une fonction intégrée appelée par builtin('_ismemberoneoutput',a,b) avec les mêmes fonctionnalités.

Comme les applications ci-dessus de ismember, etc. sont quelque peu en arrière (recherche de chaque élément de x dans le deuxième argument plutôt que l'inverse), le code est beaucoup plus lent que nécessaire. Comme le fait remarquer le PO, il est regrettable que [~,loc]=ismember(3,x) ne fournisse que l'emplacement de la première occurrence de 3 dans x, plutôt que tout. Cependant, si vous avez une version récente de MATLAB (R2012b +, je pense), vous pouvez utiliser encore plus de fonctions intégrées non documentées pour obtenir le premier et le dernier index! Ce sont ismembc2 et builtin('_ismemberfirst',searchfor,x):

firstInd = builtin('_ismemberfirst',searchfor,x); % find first occurrence lastInd = ismembc2(searchfor,x); % find last occurrence % lastInd = ismembc2(searchfor,x(firstInd:end))+firstInd-1; % slower inds = firstInd:lastInd;

Encore plus lent que l'excellent code MATLAB de Daniel R., mais le voici (rntmX ajouté à la référence de randomatlabuser), juste pour le fun:

mean([rntm1 rntm2 rntm3 rntmX]) ans = 0.559204323050486 0.263756852283128 0.000017989974213 0.000153682125682

Voici les bits de documentation de ces fonctions à l'intérieur ismember.m:

% ISMEMBC2 - S must be sorted - Returns a vector of the locations of % the elements of A occurring in S. If multiple instances occur, % the last occurrence is returned % ISMEMBERFIRST(A,B) - B must be sorted - Returns a vector of the % locations of the elements of A occurring in B. If multiple % instances occur, the first occurence is returned.

Il y a en fait une référence à un ISMEMBERLAST intégré, mais cela ne semble pas exister (pour le moment?).

randomatlabuser · Answer

Ce n'est pas une réponse - je compare simplement la durée d'exécution des trois solutions suggérées par chappjc et Daniel R.

N = 5e7; % length of vector p = 0.99; % probability KK = 100; % number of instances rntm1 = zeros(KK, 1); % runtime with ismember rntm2 = zeros(KK, 1); % runtime with ismembc rntm3 = zeros(KK, 1); % runtime with Daniel's function for kk = 1:KK x = cumsum(Rand(N, 1) > p); searchfor = x(ceil(4*N/5)); tic [tf,loc]=ismember(x, searchfor); inds1 = find(tf); rntm1(kk) = toc; tic tf = ismembc(x, searchfor); inds2 = find(tf); rntm2(kk) = toc; tic a=1; b=numel(x); c=1; d=numel(x); while (a+1<b||c+1<d) lw=(floor((a+b)/2)); if (x(lw)<searchfor) a=lw; else b=lw; end lw=(floor((c+d)/2)); if (x(lw)<=searchfor) c=lw; else d=lw; end end inds3 = (b:c)'; rntm3(kk) = toc; end

La recherche binaire de Daniel est très rapide.

% Mean of running time mean([rntm1 rntm2 rntm3]) % 0.631132275892504 0.295233981447746 0.000400786666188 % Percentiles of running time prctile([rntm1 rntm2 rntm3], [0 25 50 75 100]) % 0.410663611685559 0.175298784336465 0.000012828868032 % 0.429120717937665 0.185935198821797 0.000014539383770 % 0.582281366154709 0.268931132925888 0.000019243302048 % 0.775917520641649 0.385297304740352 0.000026940622867 % 1.063753914942895 0.592429428396956 0.037773746662356

didac · Answer

J'avais besoin d'une fonction comme celle-ci. Merci pour le post @Daniel!

J'ai un peu travaillé avec elle car j'avais besoin de trouver plusieurs index dans le même tableau. Je voulais éviter les frais généraux de arrayfun (ou similaire) ou appeler la fonction plusieurs fois. Donc, vous pouvez passer un tas de valeurs dans range et vous obtiendrez les index dans le tableau.

function idx = findInSorted(x,range) % Author Dídac Rodríguez Arbonès (May 2018) % Based on Daniel Roeske's solution: % Daniel Roeske <danielroeske.de> % https://github.com/danielroeske/danielsmatlabtools/blob/master/matlab/data/findinsorted.m range = sort(range); idx = nan(size(range)); for i=1:numel(range) idx(i) = aux(x, range(i)); end end function b = aux(x, lim) a=1; b=numel(x); if lim<=x(1) b=a; end if lim>=x(end) a=b; end while (a+1<b) lw=(floor((a+b)/2)); if (x(lw)<lim) a=lw; else b=lw; end end end

Je suppose que vous pouvez utiliser plutôt parfor ou arrayfun. Je ne me suis pas testé à quelle taille de range cela paye, cependant.

Une autre amélioration possible consisterait à utiliser les index précédemment trouvés (si range est trié) pour réduire l'espace de recherche. Je suis sceptique quant à son potentiel d'économie de processeur en raison du runtime O(log n).

La fonction finale a fini par courir un peu plus vite. J'ai utilisé le framework de @randomatlabuser pour cela:

N = 5e6; % length of vector p = 0.99; % probability KK = 100; % number of instances rntm1 = zeros(KK, 1); % runtime with ismember rntm2 = zeros(KK, 1); % runtime with ismembc rntm3 = zeros(KK, 1); % runtime with Daniel's function for kk = 1:KK x = cumsum(Rand(N, 1) > p); searchfor = x(ceil(4*N/5)); tic range = sort(searchfor); idx = nan(size(range)); for i=1:numel(range) idx(i) = aux(x, range(i)); end rntm1(kk) = toc; tic a=1; b=numel(x); c=1; d=numel(x); while (a+1<b||c+1<d) lw=(floor((a+b)/2)); if (x(lw)<searchfor) a=lw; else b=lw; end lw=(floor((c+d)/2)); if (x(lw)<=searchfor) c=lw; else d=lw; end end inds3 = (b:c)'; rntm2(kk) = toc; end %% function b = aux(x, lim) a=1; b=numel(x); if lim<=x(1) b=a; end if lim>=x(end) a=b; end while (a+1<b) lw=(floor((a+b)/2)); if (x(lw)<lim) a=lw; else b=lw; end end end

Ce n'est pas une grosse amélioration, mais ça aide car j'ai besoin de plusieurs milliers de recherches.

% Mean of running time mean([rntm1 rntm2]) % 9.9624e-05 5.6303e-05 % Percentiles of running time prctile([rntm1 rntm2], [0 25 50 75 100]) % 3.0435e-05 1.0524e-05 % 3.4133e-05 1.2231e-05 % 3.7262e-05 1.3369e-05 % 3.9111e-05 1.4507e-05 % 0.0027426 0.0020301

J'espère que cela peut aider quelqu'un.

MODIFIER

S'il existe une chance significative d'obtenir des correspondances exactes, il est rentable d'utiliser la très rapide ismember avant d'appeler la fonction:

[found, idx] = ismember(range, x); idx(~found) = arrayfun(@(r) aux(x, r), range(~found));