Comment reconvertir la séquence prédite en texte en kéros?

Question

J'ai un modèle d'apprentissage séquence à séquence qui fonctionne très bien et capable de prédire certaines sorties. Le problème est que je ne sais pas comment reconvertir la sortie en séquence de texte.

Ceci est mon code.

from keras.preprocessing.text import Tokenizer,base_filter from keras.preprocessing.sequence import pad_sequences from keras.models import Sequential from keras.layers import Dense txt1="""What makes this problem difficult is that the sequences can vary in length, be comprised of a very large vocabulary of input symbols and may require the model to learn the long term context or dependencies between symbols in the input sequence.""" #txt1 is used for fitting tk = Tokenizer(nb_words=2000, filters=base_filter(), lower=True, split=" ") tk.fit_on_texts(txt1) #convert text to sequence t= tk.texts_to_sequences(txt1) #padding to feed the sequence to keras model t=pad_sequences(t, maxlen=10) model = Sequential() model.add(Dense(10,input_dim=10)) model.add(Dense(10,activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy']) #predicting new sequcenc pred=model.predict(t) #Convert predicted sequence to text pred=??

Ben Usman · Answer

Voici une solution que j'ai trouvée:

reverse_Word_map = dict(map(reversed, tokenizer.Word_index.items()))

Esben Eickhardt · Answer

J'ai dû résoudre le même problème, alors voici comment j'ai fini par le faire (inspiré du dictionnaire inversé @Ben Usemans).

# Importing library from keras.preprocessing.text import Tokenizer # My texts texts = ['These are two crazy sentences', 'that I want to convert back and forth'] # Creating a tokenizer tokenizer = Tokenizer(lower=True) # Building Word indices tokenizer.fit_on_texts(texts) # Tokenizing sentences sentences = tokenizer.texts_to_sequences(texts) >sentences >[[1, 2, 3, 4, 5], [6, 7, 8, 9, 10, 11, 12, 13]] # Creating a reverse dictionary reverse_Word_map = dict(map(reversed, tokenizer.Word_index.items())) # Function takes a tokenized sentence and returns the words def sequence_to_text(list_of_indices): # Looking up words in dictionary words = [reverse_Word_map.get(letter) for letter in list_of_indices] return(words) # Creating texts my_texts = list(map(sequence_to_text, sentences)) >my_texts >[['these', 'are', 'two', 'crazy', 'sentences'], ['that', 'i', 'want', 'to', 'convert', 'back', 'and', 'forth']]

Jairo Alves · Answer

Vous pouvez utiliser directement la fonction inverse tokenizer.sequences_to_texts.

text = tokenizer.sequences_to_texts(<list of the integer equivalent encodings>)

J'ai testé ce qui précède et cela fonctionne comme prévu.

PS: Faites particulièrement attention à ce que l'argument soit la liste des encodages entiers et non ceux One Hot.

titipata · Answer

Vous pouvez rendre le dictionnaire qui mappe l'index sur le caractère.

index_Word = {v: k for k, v in tk.Word_index.items()} # map back seqs = tk.texts_to_sequences(txt1) words = [] for seq in seqs: if len(seq): words.append(index_Word.get(seq[0])) else: words.append(' ') print(''.join(words)) # output >>> 'what makes this problem difficult is that the sequences can vary in length >>> be comprised of a very large vocabulary of input symbols and may require the model >>> to learn the long term context or dependencies between symbols in the input sequence '

Cependant, dans la question, vous essayez d'utiliser une séquence de caractères pour prédire la sortie de 10 classes qui n'est pas le modèle de séquence à séquence. Dans ce cas, vous ne pouvez pas simplement retourner la prédiction (ou pred.argmax(axis=1)) à la séquence de caractères.