Stateful LSTM and stream predictions










12















I've trained an LSTM model (built with Keras and TF) on multiple batches of 7 samples with 3 features each, with a shape the like below sample (numbers below are just placeholders for the purpose of explanation), each batch is labeled 0 or 1:



Data:



[
[[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3]]
[[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3]]
[[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3]]
...
]


i.e: batches of m sequences, each of length 7, whose elements are 3-dimensional vectors (so batch has shape (m*7*3))



Target:



[
[1]
[0]
[1]
...
]


On my production environment data is a stream of samples with 3 features ([1,2,3],[1,2,3]...). I would like to stream each sample as it arrives to my model and get the intermediate probability without waiting for the entire batch (7) - see the animation below.



enter image description here



One of my thoughts was padding the batch with 0 for the missing samples,
[[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[1,2,3]] but that seems to be inefficient.



Will appreciate any help that will point me in the right direction of both saving the LSTM intermediate state in a persistent way, while waiting for the next sample and predicting on a model trained on a specific batch size with partial data.




Update, including model code:



opt = optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=10e-8, decay=0.001)
model = Sequential()

num_features = data.shape[2]
num_samples = data.shape[1]

first_lstm = LSTM(32, batch_input_shape=(None, num_samples, num_features), return_sequences=True, activation='tanh')
model.add(
first_lstm)
model.add(LeakyReLU())
model.add(Dropout(0.2))
model.add(LSTM(16, return_sequences=True, activation='tanh'))
model.add(Dropout(0.2))
model.add(LeakyReLU())
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer=opt,
metrics=['accuracy', keras_metrics.precision(), keras_metrics.recall(), f1])


Model Summary:



_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_1 (LSTM) (None, 100, 32) 6272
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU) (None, 100, 32) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 100, 32) 0
_________________________________________________________________
lstm_2 (LSTM) (None, 100, 16) 3136
_________________________________________________________________
dropout_2 (Dropout) (None, 100, 16) 0
_________________________________________________________________
leaky_re_lu_2 (LeakyReLU) (None, 100, 16) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 1600) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 1601
=================================================================
Total params: 11,009
Trainable params: 11,009
Non-trainable params: 0
_________________________________________________________________









share|improve this question



















  • 1





    @ShlomiSchwartz I added a note at the beginning of my answer. Please read that to make sure you are on the right track.

    – today
    Nov 15 '18 at 8:20






  • 1





    @ShlomiSchwartz This might help to understand the difference.

    – today
    Nov 15 '18 at 8:30






  • 1





    @ShlomiSchwartz It does not matter whether your training model is stateful or not. You can use the solutions involving setting stateful=True, by copying the weights of training model to inference model and enable statefulness there as @DanielMöller 's answer does.

    – today
    Nov 15 '18 at 16:52







  • 1





    Hello @ShlomiSchwartz, I updated my answer considering your model architecture, please take a look.

    – Daniel Möller
    Nov 16 '18 at 11:39






  • 1





    If the batches are parts of a longer sequence, your model should be stateful, or you should be using batches containing "entire" sequences. Notice that in all cases, the models will NOT see any relation between sequences in the SAME batch. Stateful models connects one batch to another batch. I suggest a detailed look at my answer in the link to understand exactly how keras interprets your data: stackoverflow.com/questions/38714959/understanding-keras-lstms/…

    – Daniel Möller
    Nov 17 '18 at 1:01















12















I've trained an LSTM model (built with Keras and TF) on multiple batches of 7 samples with 3 features each, with a shape the like below sample (numbers below are just placeholders for the purpose of explanation), each batch is labeled 0 or 1:



Data:



[
[[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3]]
[[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3]]
[[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3]]
...
]


i.e: batches of m sequences, each of length 7, whose elements are 3-dimensional vectors (so batch has shape (m*7*3))



Target:



[
[1]
[0]
[1]
...
]


On my production environment data is a stream of samples with 3 features ([1,2,3],[1,2,3]...). I would like to stream each sample as it arrives to my model and get the intermediate probability without waiting for the entire batch (7) - see the animation below.



enter image description here



One of my thoughts was padding the batch with 0 for the missing samples,
[[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[1,2,3]] but that seems to be inefficient.



Will appreciate any help that will point me in the right direction of both saving the LSTM intermediate state in a persistent way, while waiting for the next sample and predicting on a model trained on a specific batch size with partial data.




Update, including model code:



opt = optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=10e-8, decay=0.001)
model = Sequential()

num_features = data.shape[2]
num_samples = data.shape[1]

first_lstm = LSTM(32, batch_input_shape=(None, num_samples, num_features), return_sequences=True, activation='tanh')
model.add(
first_lstm)
model.add(LeakyReLU())
model.add(Dropout(0.2))
model.add(LSTM(16, return_sequences=True, activation='tanh'))
model.add(Dropout(0.2))
model.add(LeakyReLU())
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer=opt,
metrics=['accuracy', keras_metrics.precision(), keras_metrics.recall(), f1])


Model Summary:



_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_1 (LSTM) (None, 100, 32) 6272
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU) (None, 100, 32) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 100, 32) 0
_________________________________________________________________
lstm_2 (LSTM) (None, 100, 16) 3136
_________________________________________________________________
dropout_2 (Dropout) (None, 100, 16) 0
_________________________________________________________________
leaky_re_lu_2 (LeakyReLU) (None, 100, 16) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 1600) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 1601
=================================================================
Total params: 11,009
Trainable params: 11,009
Non-trainable params: 0
_________________________________________________________________









share|improve this question



















  • 1





    @ShlomiSchwartz I added a note at the beginning of my answer. Please read that to make sure you are on the right track.

    – today
    Nov 15 '18 at 8:20






  • 1





    @ShlomiSchwartz This might help to understand the difference.

    – today
    Nov 15 '18 at 8:30






  • 1





    @ShlomiSchwartz It does not matter whether your training model is stateful or not. You can use the solutions involving setting stateful=True, by copying the weights of training model to inference model and enable statefulness there as @DanielMöller 's answer does.

    – today
    Nov 15 '18 at 16:52







  • 1





    Hello @ShlomiSchwartz, I updated my answer considering your model architecture, please take a look.

    – Daniel Möller
    Nov 16 '18 at 11:39






  • 1





    If the batches are parts of a longer sequence, your model should be stateful, or you should be using batches containing "entire" sequences. Notice that in all cases, the models will NOT see any relation between sequences in the SAME batch. Stateful models connects one batch to another batch. I suggest a detailed look at my answer in the link to understand exactly how keras interprets your data: stackoverflow.com/questions/38714959/understanding-keras-lstms/…

    – Daniel Möller
    Nov 17 '18 at 1:01













12












12








12


5






I've trained an LSTM model (built with Keras and TF) on multiple batches of 7 samples with 3 features each, with a shape the like below sample (numbers below are just placeholders for the purpose of explanation), each batch is labeled 0 or 1:



Data:



[
[[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3]]
[[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3]]
[[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3]]
...
]


i.e: batches of m sequences, each of length 7, whose elements are 3-dimensional vectors (so batch has shape (m*7*3))



Target:



[
[1]
[0]
[1]
...
]


On my production environment data is a stream of samples with 3 features ([1,2,3],[1,2,3]...). I would like to stream each sample as it arrives to my model and get the intermediate probability without waiting for the entire batch (7) - see the animation below.



enter image description here



One of my thoughts was padding the batch with 0 for the missing samples,
[[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[1,2,3]] but that seems to be inefficient.



Will appreciate any help that will point me in the right direction of both saving the LSTM intermediate state in a persistent way, while waiting for the next sample and predicting on a model trained on a specific batch size with partial data.




Update, including model code:



opt = optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=10e-8, decay=0.001)
model = Sequential()

num_features = data.shape[2]
num_samples = data.shape[1]

first_lstm = LSTM(32, batch_input_shape=(None, num_samples, num_features), return_sequences=True, activation='tanh')
model.add(
first_lstm)
model.add(LeakyReLU())
model.add(Dropout(0.2))
model.add(LSTM(16, return_sequences=True, activation='tanh'))
model.add(Dropout(0.2))
model.add(LeakyReLU())
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer=opt,
metrics=['accuracy', keras_metrics.precision(), keras_metrics.recall(), f1])


Model Summary:



_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_1 (LSTM) (None, 100, 32) 6272
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU) (None, 100, 32) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 100, 32) 0
_________________________________________________________________
lstm_2 (LSTM) (None, 100, 16) 3136
_________________________________________________________________
dropout_2 (Dropout) (None, 100, 16) 0
_________________________________________________________________
leaky_re_lu_2 (LeakyReLU) (None, 100, 16) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 1600) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 1601
=================================================================
Total params: 11,009
Trainable params: 11,009
Non-trainable params: 0
_________________________________________________________________









share|improve this question
















I've trained an LSTM model (built with Keras and TF) on multiple batches of 7 samples with 3 features each, with a shape the like below sample (numbers below are just placeholders for the purpose of explanation), each batch is labeled 0 or 1:



Data:



[
[[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3]]
[[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3]]
[[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3]]
...
]


i.e: batches of m sequences, each of length 7, whose elements are 3-dimensional vectors (so batch has shape (m*7*3))



Target:



[
[1]
[0]
[1]
...
]


On my production environment data is a stream of samples with 3 features ([1,2,3],[1,2,3]...). I would like to stream each sample as it arrives to my model and get the intermediate probability without waiting for the entire batch (7) - see the animation below.



enter image description here



One of my thoughts was padding the batch with 0 for the missing samples,
[[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[0,0,0],[1,2,3]] but that seems to be inefficient.



Will appreciate any help that will point me in the right direction of both saving the LSTM intermediate state in a persistent way, while waiting for the next sample and predicting on a model trained on a specific batch size with partial data.




Update, including model code:



opt = optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=10e-8, decay=0.001)
model = Sequential()

num_features = data.shape[2]
num_samples = data.shape[1]

first_lstm = LSTM(32, batch_input_shape=(None, num_samples, num_features), return_sequences=True, activation='tanh')
model.add(
first_lstm)
model.add(LeakyReLU())
model.add(Dropout(0.2))
model.add(LSTM(16, return_sequences=True, activation='tanh'))
model.add(Dropout(0.2))
model.add(LeakyReLU())
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer=opt,
metrics=['accuracy', keras_metrics.precision(), keras_metrics.recall(), f1])


Model Summary:



_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_1 (LSTM) (None, 100, 32) 6272
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU) (None, 100, 32) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 100, 32) 0
_________________________________________________________________
lstm_2 (LSTM) (None, 100, 16) 3136
_________________________________________________________________
dropout_2 (Dropout) (None, 100, 16) 0
_________________________________________________________________
leaky_re_lu_2 (LeakyReLU) (None, 100, 16) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 1600) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 1601
=================================================================
Total params: 11,009
Trainable params: 11,009
Non-trainable params: 0
_________________________________________________________________






python tensorflow keras lstm stateful






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 15 '18 at 8:24







Shlomi Schwartz

















asked Nov 7 '18 at 13:17









Shlomi SchwartzShlomi Schwartz

3,3052073118




3,3052073118







  • 1





    @ShlomiSchwartz I added a note at the beginning of my answer. Please read that to make sure you are on the right track.

    – today
    Nov 15 '18 at 8:20






  • 1





    @ShlomiSchwartz This might help to understand the difference.

    – today
    Nov 15 '18 at 8:30






  • 1





    @ShlomiSchwartz It does not matter whether your training model is stateful or not. You can use the solutions involving setting stateful=True, by copying the weights of training model to inference model and enable statefulness there as @DanielMöller 's answer does.

    – today
    Nov 15 '18 at 16:52







  • 1





    Hello @ShlomiSchwartz, I updated my answer considering your model architecture, please take a look.

    – Daniel Möller
    Nov 16 '18 at 11:39






  • 1





    If the batches are parts of a longer sequence, your model should be stateful, or you should be using batches containing "entire" sequences. Notice that in all cases, the models will NOT see any relation between sequences in the SAME batch. Stateful models connects one batch to another batch. I suggest a detailed look at my answer in the link to understand exactly how keras interprets your data: stackoverflow.com/questions/38714959/understanding-keras-lstms/…

    – Daniel Möller
    Nov 17 '18 at 1:01












  • 1





    @ShlomiSchwartz I added a note at the beginning of my answer. Please read that to make sure you are on the right track.

    – today
    Nov 15 '18 at 8:20






  • 1





    @ShlomiSchwartz This might help to understand the difference.

    – today
    Nov 15 '18 at 8:30






  • 1





    @ShlomiSchwartz It does not matter whether your training model is stateful or not. You can use the solutions involving setting stateful=True, by copying the weights of training model to inference model and enable statefulness there as @DanielMöller 's answer does.

    – today
    Nov 15 '18 at 16:52







  • 1





    Hello @ShlomiSchwartz, I updated my answer considering your model architecture, please take a look.

    – Daniel Möller
    Nov 16 '18 at 11:39






  • 1





    If the batches are parts of a longer sequence, your model should be stateful, or you should be using batches containing "entire" sequences. Notice that in all cases, the models will NOT see any relation between sequences in the SAME batch. Stateful models connects one batch to another batch. I suggest a detailed look at my answer in the link to understand exactly how keras interprets your data: stackoverflow.com/questions/38714959/understanding-keras-lstms/…

    – Daniel Möller
    Nov 17 '18 at 1:01







1




1





@ShlomiSchwartz I added a note at the beginning of my answer. Please read that to make sure you are on the right track.

– today
Nov 15 '18 at 8:20





@ShlomiSchwartz I added a note at the beginning of my answer. Please read that to make sure you are on the right track.

– today
Nov 15 '18 at 8:20




1




1





@ShlomiSchwartz This might help to understand the difference.

– today
Nov 15 '18 at 8:30





@ShlomiSchwartz This might help to understand the difference.

– today
Nov 15 '18 at 8:30




1




1





@ShlomiSchwartz It does not matter whether your training model is stateful or not. You can use the solutions involving setting stateful=True, by copying the weights of training model to inference model and enable statefulness there as @DanielMöller 's answer does.

– today
Nov 15 '18 at 16:52






@ShlomiSchwartz It does not matter whether your training model is stateful or not. You can use the solutions involving setting stateful=True, by copying the weights of training model to inference model and enable statefulness there as @DanielMöller 's answer does.

– today
Nov 15 '18 at 16:52





1




1





Hello @ShlomiSchwartz, I updated my answer considering your model architecture, please take a look.

– Daniel Möller
Nov 16 '18 at 11:39





Hello @ShlomiSchwartz, I updated my answer considering your model architecture, please take a look.

– Daniel Möller
Nov 16 '18 at 11:39




1




1





If the batches are parts of a longer sequence, your model should be stateful, or you should be using batches containing "entire" sequences. Notice that in all cases, the models will NOT see any relation between sequences in the SAME batch. Stateful models connects one batch to another batch. I suggest a detailed look at my answer in the link to understand exactly how keras interprets your data: stackoverflow.com/questions/38714959/understanding-keras-lstms/…

– Daniel Möller
Nov 17 '18 at 1:01





If the batches are parts of a longer sequence, your model should be stateful, or you should be using batches containing "entire" sequences. Notice that in all cases, the models will NOT see any relation between sequences in the SAME batch. Stateful models connects one batch to another batch. I suggest a detailed look at my answer in the link to understand exactly how keras interprets your data: stackoverflow.com/questions/38714959/understanding-keras-lstms/…

– Daniel Möller
Nov 17 '18 at 1:01












4 Answers
4






active

oldest

votes


















4





+500









I think there might be an easier solution.



If your model does not have convolutional layers or any other layers that act upon the length/steps dimension, you can simply mark it as stateful=True



Warning: your model has layers that act on the length dimension !!



The Flatten layer transforms the length dimension into a feature dimension. This will completely prevent you from achieving your goal. If the Flatten layer is expecting 7 steps, you will always need 7 steps.



So, before applying my answer below, fix your model to not use the Flatten layer. Instead, it can just remove the return_sequences=True for the last LSTM layer.



The following code fixed that and also prepares a few things to be used with the answer below:



def createModel(forTraining):

#model for training, stateful=False, any batch size
if forTraining == True:
batchSize = None
stateful = False

#model for predicting, stateful=True, fixed batch size
else:
batchSize = 1
stateful = True

model = Sequential()

first_lstm = LSTM(32,
batch_input_shape=(batchSize, num_samples, num_features),
return_sequences=True, activation='tanh',
stateful=stateful)

model.add(first_lstm)
model.add(LeakyReLU())
model.add(Dropout(0.2))

#this is the last LSTM layer, use return_sequences=False
model.add(LSTM(16, return_sequences=False, stateful=stateful, activation='tanh'))

model.add(Dropout(0.2))
model.add(LeakyReLU())

#don't add a Flatten!!!
#model.add(Flatten())

model.add(Dense(1, activation='sigmoid'))

if forTraining == True:
compileThisModel(model)


With this, you will be able to train with 7 steps and predict with one step. Otherwise it will not be possible.



The usage of a stateful model as a solution for your question



First, train this new model again, because it has no Flatten layer:



trainingModel = createModel(forTraining=True)
trainThisModel(trainingModel)


Now, with this trained model, you can simply create a new model exactly the same way you created the trained model, but marking stateful=True in all its LSTM layers. And we should copy the weights from the trained model.



Since these new layers will need a fixed batch size (Keras' rules), I assumed it would be 1 (one single stream is coming, not m streams) and added it to the model creation above.



predictingModel = createModel(forTraining=False)
predictingModel.set_weights(trainingModel.get_weights())


And voilà. Just predict the outputs of the model with a single step:



pseudo for loop as samples arrive to your model:
prob = predictingModel.predict_on_batch(sample)

#where sample.shape == (1, 1, 3)


When you decide that you reached the end of what you consider a continuous sequence, call predictingModel.reset_states() so you can safely start a new sequence without the model thinking it should be mended at the end of the previous one.




Saving and loading states



Just get and set them, saving with h5py:



def saveStates(model, saveName):

f = h5py.File(saveName,'w')

for l, lay in enumerate(model.layers):
#if you have nested models,
#consider making this recurrent testing for layers in layers
if isinstance(lay,RNN):
for s, stat in enumerate(lay.states):
f.create_dataset('states_' + str(l) + '_' + str(s),
data=K.eval(stat),
dtype=K.dtype(stat))

f.close()


def loadStates(model, saveName):

f = h5py.File(saveName, 'r')
allStates = list(f.keys())

for stateKey in allStates:
name, layer, state = stateKey.split('_')
layer = int(layer)
state = int(state)

K.set_value(model.layers[layer].states[state], f.get(stateKey))

f.close()


Working test for saving/loading states



import h5py, numpy as np
from keras.layers import RNN, LSTM, Dense, Input
from keras.models import Model
import keras.backend as K




def createModel():
inp = Input(batch_shape=(1,None,3))
out = LSTM(5,return_sequences=True, stateful=True)(inp)
out = LSTM(2, stateful=True)(out)
out = Dense(1)(out)
model = Model(inp,out)
return model


def saveStates(model, saveName):

f = h5py.File(saveName,'w')

for l, lay in enumerate(model.layers):
#if you have nested models, consider making this recurrent testing for layers in layers
if isinstance(lay,RNN):
for s, stat in enumerate(lay.states):
f.create_dataset('states_' + str(l) + '_' + str(s), data=K.eval(stat), dtype=K.dtype(stat))

f.close()


def loadStates(model, saveName):

f = h5py.File(saveName, 'r')
allStates = list(f.keys())

for stateKey in allStates:
name, layer, state = stateKey.split('_')
layer = int(layer)
state = int(state)

K.set_value(model.layers[layer].states[state], f.get(stateKey))

f.close()

def printStates(model):

for l in model.layers:
#if you have nested models, consider making this recurrent testing for layers in layers
if isinstance(l,RNN):
for s in l.states:
print(K.eval(s))

model1 = createModel()
model2 = createModel()
model1.predict_on_batch(np.ones((1,5,3))) #changes model 1 states

print('model1')
printStates(model1)
print('model2')
printStates(model2)

saveStates(model1,'testStates5')
loadStates(model2,'testStates5')

print('model1')
printStates(model1)
print('model2')
printStates(model2)


Considerations on the aspects of the data



In your first model (if it is stateful=False), it considers that each sequence in m is individual and not connected to the others. It also considers that each batch contains unique sequences.



If this is not the case, you might want to train the stateful model instead (considering that each sequence is actually connected to the previous sequence). And then you would need m batches of 1 sequence. -> m x (1, 7 or None, 3).






share|improve this answer




















  • 2





    I thought reset_states() would mess the trained parameters, but I can confirm it doesn't, so this in combination with predict_on_batch (which overcomes the problem of having to specify a batch_size for stateful=True) makes it a very compact and elegant solution that I will definitely use myself! The downside of less verbosity is that it doesn't address the requirement of saving the LSTM intermediate state in a persistent way

    – fr_andres
    Nov 14 '18 at 18:11











  • @fr_andres , this now saves and loads states.

    – Daniel Möller
    Nov 14 '18 at 19:56











  • @DanielMöller I have a problem with stateful solutions mentioned here. I would appreciate if you could tell me where I am wrong: as far as I know, stateful LSTM layer means that one sample in a batch is the successor of the corresponding sample in previous batch (from keras docs: If x1 and x2 are successive batches of samples, then x2[i] is the follow-up sequence to x1[i], for every i). That's why you correctly mentioned that batch sizes must be the same. However, the OP does not mention that the >>>>

    – today
    Nov 15 '18 at 8:09











  • @DanielMöller >>> statefulness assumption holds for the training data and in training phase. So setting stateful=True from the beginning (i.e. in the train model) may completely destroy the learning process. And if you don't set the stateful=True argument in the train model, you cannot modify it in its replica model for inference phase. What am I missing?

    – today
    Nov 15 '18 at 8:12












  • @DanielMöller I think you can just set the weights in your inference model as you did in your answer and enable statefulness in that only. I got my answer. Thanks.

    – today
    Nov 15 '18 at 16:47


















4














If I understood correctly, you have batches of m sequences, each of length 7, whose elements are 3-dimensional vectors (so batch has shape (m*7*3)).
In any Keras RNN you can set the
return_sequences flag to True to become the intermediate states, i.e., for every batch, instead of the definitive prediction, you will get the corresponding 7 outputs, where output i represents the prediction at stage i given all inputs from 0 to i.



But you would be getting all at once at the end. As far as I know, Keras doesn't provide a direct interface for retrieving the throughput whilst the batch is being processed. This may be even more constrained if you are using any of the CUDNN-optimized variants. What you can do is basically to regard your batch as 7 succesive batches of shape (m*1*3), and feed them progressively to your LSTM, recording the hidden state and prediction at each step. For that, you can either set return_state to True and do it manually, or you can simply set statefulto True and let the object keep track of it.




The following Python2+Keras example should exactly represent what you want. Specifically:



  • allowing to save the whole LSTM intermediate state in a persistent way

  • while waiting for the next sample

  • and predicting on a model trained on a specific batch size that may be arbitrary and unknown.

For that, it includes an example of stateful=True for easiest training, and return_state=True for most precise inference, so you get a flavor of both approaches. It also assumes that you get a model that has been serialized and from which you don't know much about. The structure is closely related to the one in Andrew Ng's course, who is definitely more authoritative than me in the topic. Since you don't specify how the model has been trained, I assumed a many-to-one training setup, but this could be easily adapted.



from __future__ import print_function
from keras.layers import Input, LSTM, Dense
from keras.models import Model, load_model
from keras.optimizers import Adam
import numpy as np

# globals
SEQ_LEN = 7
HID_DIMS = 32
OUTPUT_DIMS = 3 # outputs are assumed to be scalars


##############################################################################
# define the model to be trained on a fixed batch size:
# assume many-to-one training setup (otherwise set return_sequences=True)
TRAIN_BATCH_SIZE = 20

x_in = Input(batch_shape=[TRAIN_BATCH_SIZE, SEQ_LEN, 3])
lstm = LSTM(HID_DIMS, activation="tanh", return_sequences=False, stateful=True)
dense = Dense(OUTPUT_DIMS, activation='linear')
m_train = Model(inputs=x_in, outputs=dense(lstm(x_in)))
m_train.summary()

# a dummy batch of training data of shape (TRAIN_BATCH_SIZE, SEQ_LEN, 3), with targets of shape (TRAIN_BATCH_SIZE, 3):
batch123 = np.repeat([[1, 2, 3]], SEQ_LEN, axis=0).reshape(1, SEQ_LEN, 3).repeat(TRAIN_BATCH_SIZE, axis=0)
targets = np.repeat([[123,234,345]], TRAIN_BATCH_SIZE, axis=0) # dummy [[1,2,3],,,]-> [123,234,345] mapping to be learned


# train the model on a fixed batch size and save it
print(">> INFERECE BEFORE TRAINING MODEL:", m_train.predict(batch123, batch_size=TRAIN_BATCH_SIZE, verbose=0))
m_train.compile(optimizer=Adam(lr=0.5), loss='mean_squared_error', metrics=['mae'])
m_train.fit(batch123, targets, epochs=100, batch_size=TRAIN_BATCH_SIZE)
m_train.save("trained_lstm.h5")
print(">> INFERECE AFTER TRAINING MODEL:", m_train.predict(batch123, batch_size=TRAIN_BATCH_SIZE, verbose=0))


##############################################################################
# Now, although we aren't training anymore, we want to do step-wise predictions
# that do alter the inner state of the model, and keep track of that.


m_trained = load_model("trained_lstm.h5")
print(">> INFERECE AFTER RELOADING TRAINED MODEL:", m_trained.predict(batch123, batch_size=TRAIN_BATCH_SIZE, verbose=0))

# now define an analogous model that allows a flexible batch size for inference:
x_in = Input(shape=[SEQ_LEN, 3])
h_in = Input(shape=[HID_DIMS])
c_in = Input(shape=[HID_DIMS])
pred_lstm = LSTM(HID_DIMS, activation="tanh", return_sequences=False, return_state=True, name="lstm_infer")
h, cc, c = pred_lstm(x_in, initial_state=[h_in, c_in])
prediction = Dense(OUTPUT_DIMS, activation='linear', name="dense_infer")(h)
m_inference = Model(inputs=[x_in, h_in, c_in], outputs=[prediction, h,cc,c])

# Let's confirm that this model is able to load the trained parameters:
# first, check that the performance from scratch is not good:
print(">> INFERENCE BEFORE SWAPPING MODEL:")
predictions, hs, zs, cs = m_inference.predict([batch123,
np.zeros((TRAIN_BATCH_SIZE, HID_DIMS)),
np.zeros((TRAIN_BATCH_SIZE, HID_DIMS))],
batch_size=1)
print(predictions)


# import state from the trained model state and check that it works:
print(">> INFERENCE AFTER SWAPPING MODEL:")
for layer in m_trained.layers:
if "lstm" in layer.name:
m_inference.get_layer("lstm_infer").set_weights(layer.get_weights())
elif "dense" in layer.name:
m_inference.get_layer("dense_infer").set_weights(layer.get_weights())

predictions, _, _, _ = m_inference.predict([batch123,
np.zeros((TRAIN_BATCH_SIZE, HID_DIMS)),
np.zeros((TRAIN_BATCH_SIZE, HID_DIMS))],
batch_size=1)
print(predictions)


# finally perform granular predictions while keeping the recurrent activations. Starting the sequence with zeros is a common practice, but depending on how you trained, you might have an <END_OF_SEQUENCE> character that you might want to propagate instead:
h, c = np.zeros((TRAIN_BATCH_SIZE, HID_DIMS)), np.zeros((TRAIN_BATCH_SIZE, HID_DIMS))
for i in range(len(batch123)):
# about output shape: https://keras.io/layers/recurrent/#rnn
# h,z,c hold the network's throughput: h is the proper LSTM output, c is the accumulator and cc is (probably) the candidate
current_input = batch123[i:i+1] # the length of this feed is arbitrary, doesn't have to be 1
pred, h, cc, c = m_inference.predict([current_input, h, c])
print("input:", current_input)
print("output:", pred)
print(h.shape, cc.shape, c.shape)
raw_input("do something with your prediction and hidden state and press any key to continue")



Additional information:



Since we have two forms of state persistency:

1. The saved/trained parameters of the model that are the same for each sequence

2. The a, c states that evolve throughout the sequences and may be "restarted"



It is interesting to take a look at the guts of the LSTM object. In the Python example that I provide, the a and c weights are explicitly handled, but the trained parameters aren't, and it may not be obvious how they are internally implemented or what do they mean. They can be inspected as follows:



for w in lstm.weights:
print(w.name, w.shape)


In our case (32 hidden states) returns the following:



lstm_1/kernel:0 (3, 128)
lstm_1/recurrent_kernel:0 (32, 128)
lstm_1/bias:0 (128,)


We observe a dimensionality of 128. Why is that? this link describes the Keras LSTM implementation as follows:



enter image description here




The g is the recurrent activation, p is the activation, Ws are the kernels, Us are the recurrent kernels, h is the hidden variable which is the output too and the notation * is an element-wise multiplication.




Which explains the 128=32*4 being the parameters for the affine transformation happening inside each one of the 4 gates, concatenated:



  • The matrix of shape (3, 128) (named kernel) handles the input for a given sequence element

  • The matrix of shape (32, 128) (named recurrent_kernel) handles the input for the last recurrent state h.

  • The vector of shape (128,) (named bias), as usual in any other NN setup.





share|improve this answer

























  • Thanks, for this amazing explanation, I'll give it a go and will update.

    – Shlomi Schwartz
    Nov 8 '18 at 9:21











  • If I use your suggestion of spliting the batch to single samples, does that mean I need to retrain my model on the same input and train on (m*1*3) shape, like so? [[1,2,3],[1,2,3]...7 samples] & target each with the coresponding batch label like so? [1,1,1,1,1,1,1,0,0,0,0,0,0,0] where each 7 lables are the same for each batch.

    – Shlomi Schwartz
    Nov 8 '18 at 9:31











  • If you keep the hidden state, performing m one-element forward propagations is equivalent to performing one m-element forward propagation. But if you want the (speedup&regularization) advantages of batch_size>1 you do have to wait until you have all results and then average them (if I understood your question correctly you don't want to refrain from that). For that, either you have two setups, one batched for training and one with for loop for production, or you can add a concatenator to the forapproach to artificially build-up your batch

    – fr_andres
    Nov 8 '18 at 14:28






  • 1





    In other words, if you trained it with batches and went well, you can still use the same model to perform one-by-one predictions like this: just be sure that the c hidden state remains untouched, and the h-state adapts to the current situation (depending on how you trained, you may have an "initial state" to start a fresh sequence, or an "end-of-sequence" character that you can push repeatedly to reset the h-state)

    – fr_andres
    Nov 8 '18 at 14:31







  • 1





    The best way is to use a Model as in my example, then you have plenty of built-in support for serialization. Once you have numpy, json or whatever standard format the backend is pretty much a matter of choice (HDF5 is precisely a compromise between filesystem and kvdb)

    – fr_andres
    Nov 8 '18 at 16:46


















2














Note: This answer assumes that your model in training phase is not stateful. You must understand what an stateful RNN layer is and make sure that the training data has the corresponding properties of statefulness. In short it means there is a dependency between the sequences, i.e. one sequence is the follow-up to another sequence, which you want to consider in your model. If your model and training data is stateful then I think other answers which involve setting stateful=True for the RNN layers from the beginning are simpler.



Update: No matter the training model is stateful or not, you can always copy its weights to the inference model and enable statefulness. So I think solutions based on setting stateful=True are shorter and better than mine. Their only drawback is that the batch size in these solutions must be fixed.




Note that the output of a LSTM layer over a single sequence is determined by its weight matrices, which are fixed, and its internal states which depends on the previous processed timestep. Now to get the output of LSTM layer for a single sequence of length m, one obvious way is to feed the entire sequence to the LSTM layer in one go. However, as I stated earlier, since its internal states depends on the previous timestep, we can exploit this fact and feed that single sequence chunk by chunk by getting the state of LSTM layer at the end of processing a chunk and pass it to the LSTM layer for processing the next chunk. To make it more clear, suppose the sequence length is 7 (i.e. it has 7 timesteps of fixed-length feature vectors). As an example, it is possible to process this sequence like this:



  1. Feed the timesteps 1 and 2 to the LSTM layer; get the final state (call it C1).

  2. Feed the timesteps 3, 4 and 5 and state C1 as the initial state to the LSTM layer; get the final state (call it C2).

  3. Feed the timesteps 6 and 7 and state C2 as the initial state to the LSTM layer; get the final output.

That final output is equivalent to the output produced by the LSTM layer if we had feed it the entire 7 timesteps at once.



So to realize this in Keras, you can set the return_state argument of LSTM layer to True so that you can get the intermediate state. Further, don't specify a fixed timestep length when defining the input layer. Instead use None to be able to feed the model with sequences of arbitrary length which enables us to process each sequence progressively (it's fine if your input data in training time are sequences of fixed-length).



Since you need this chuck processing capability in inference time, we need to define a new model which shares the LSTM layer used in training model and can get the initial states as input and also gives the resulting states as output. The following is a general sketch of it could be done (note that the returned state of LSTM layer is not used when training the model, we only need it in test time):



# define training model
train_input = Input(shape=(None, n_feats)) # note that the number of timesteps is None
lstm_layer = LSTM(n_units, return_state=True)
lstm_output, _, _ = lstm_layer(train_input) # note that we ignore the returned states
classifier = Dense(1, activation='sigmoid')
train_output = classifier(lstm_output)

train_model = Model(train_input, train_output)

# compile and fit the model on training data ...

# ==================================================

# define inference model
inf_input = Input(shape=(None, n_feats))
state_h_input = Input(shape=(n_units,))
state_c_input = Input(shape=(n_units,))

# we use the layers of previous model
lstm_output, state_h, state_c = lstm_layer(inf_input,
initial_state=[state_h_input, state_c_input])
output = classifier(lstm_output)

inf_model = Model([inf_input, state_h_input, state_c_input],
[output, state_h, state_c]) # note that we return the states as output


Now you can feed the inf_model as much as the timesteps of a sequence are available right now. However, note that initially you must feed the states with vectors of all zeros (which is the default initial value of states). For example, if the sequence length is 7, a sketch of what happens when new data stream is available is as follows:



state_h = np.zeros((1, n_units,))
state_c = np.zeros((1, n_units))

# three new timesteps are available
outputs = inf_model.predict([timesteps, state_h, state_c])

out = output[0,0] # you may ignore this output since the entire sequence has not been processed yet
state_h = outputs[0,1]
state_c = outputs[0,2]

# after some time another four new timesteps are available
outputs = inf_model.predict([timesteps, state_h, state_c])

# we have processed 7 timesteps, so the output is valid
out = output[0,0] # store it, pass it to another thread or do whatever you want to do with it

# reinitialize the state to make them ready for the next sequence chunk
state_h = np.zeros((1, n_units))
state_c = np.zeros((1, n_units))

# to be continued...


Of course you need to do this in some kind of loop or implement a control flow structure to process the data stream, but I think you get what the general idea looks like.



Finally, although your specific example is not a sequence-to-sequence model, but I highly recommend to read the official Keras seq2seq tutorial which I think one can learn a lot of ideas from it.






share|improve this answer

























  • Thank you for your response it was very educational

    – Shlomi Schwartz
    Nov 14 '18 at 12:10











  • Could you please have a look here stackoverflow.com/questions/53376761/… ?: :)

    – Shlomi Schwartz
    Nov 19 '18 at 14:30


















0














As far as I know, because of the static graph in Tensorflow, there is no efficient way to feed inputs with different length from the training input length.



Padding is the official way to work around with that, but it is less efficient and memory consuming. I suggest you look into Pytorch, which will be trivial to fix your problem.



There are a lot of great posts to build lstm with Pytorch, and you will understand the benefit of dynamic graph once you see them.






share|improve this answer






















    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53190253%2fstateful-lstm-and-stream-predictions%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    4 Answers
    4






    active

    oldest

    votes








    4 Answers
    4






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    4





    +500









    I think there might be an easier solution.



    If your model does not have convolutional layers or any other layers that act upon the length/steps dimension, you can simply mark it as stateful=True



    Warning: your model has layers that act on the length dimension !!



    The Flatten layer transforms the length dimension into a feature dimension. This will completely prevent you from achieving your goal. If the Flatten layer is expecting 7 steps, you will always need 7 steps.



    So, before applying my answer below, fix your model to not use the Flatten layer. Instead, it can just remove the return_sequences=True for the last LSTM layer.



    The following code fixed that and also prepares a few things to be used with the answer below:



    def createModel(forTraining):

    #model for training, stateful=False, any batch size
    if forTraining == True:
    batchSize = None
    stateful = False

    #model for predicting, stateful=True, fixed batch size
    else:
    batchSize = 1
    stateful = True

    model = Sequential()

    first_lstm = LSTM(32,
    batch_input_shape=(batchSize, num_samples, num_features),
    return_sequences=True, activation='tanh',
    stateful=stateful)

    model.add(first_lstm)
    model.add(LeakyReLU())
    model.add(Dropout(0.2))

    #this is the last LSTM layer, use return_sequences=False
    model.add(LSTM(16, return_sequences=False, stateful=stateful, activation='tanh'))

    model.add(Dropout(0.2))
    model.add(LeakyReLU())

    #don't add a Flatten!!!
    #model.add(Flatten())

    model.add(Dense(1, activation='sigmoid'))

    if forTraining == True:
    compileThisModel(model)


    With this, you will be able to train with 7 steps and predict with one step. Otherwise it will not be possible.



    The usage of a stateful model as a solution for your question



    First, train this new model again, because it has no Flatten layer:



    trainingModel = createModel(forTraining=True)
    trainThisModel(trainingModel)


    Now, with this trained model, you can simply create a new model exactly the same way you created the trained model, but marking stateful=True in all its LSTM layers. And we should copy the weights from the trained model.



    Since these new layers will need a fixed batch size (Keras' rules), I assumed it would be 1 (one single stream is coming, not m streams) and added it to the model creation above.



    predictingModel = createModel(forTraining=False)
    predictingModel.set_weights(trainingModel.get_weights())


    And voilà. Just predict the outputs of the model with a single step:



    pseudo for loop as samples arrive to your model:
    prob = predictingModel.predict_on_batch(sample)

    #where sample.shape == (1, 1, 3)


    When you decide that you reached the end of what you consider a continuous sequence, call predictingModel.reset_states() so you can safely start a new sequence without the model thinking it should be mended at the end of the previous one.




    Saving and loading states



    Just get and set them, saving with h5py:



    def saveStates(model, saveName):

    f = h5py.File(saveName,'w')

    for l, lay in enumerate(model.layers):
    #if you have nested models,
    #consider making this recurrent testing for layers in layers
    if isinstance(lay,RNN):
    for s, stat in enumerate(lay.states):
    f.create_dataset('states_' + str(l) + '_' + str(s),
    data=K.eval(stat),
    dtype=K.dtype(stat))

    f.close()


    def loadStates(model, saveName):

    f = h5py.File(saveName, 'r')
    allStates = list(f.keys())

    for stateKey in allStates:
    name, layer, state = stateKey.split('_')
    layer = int(layer)
    state = int(state)

    K.set_value(model.layers[layer].states[state], f.get(stateKey))

    f.close()


    Working test for saving/loading states



    import h5py, numpy as np
    from keras.layers import RNN, LSTM, Dense, Input
    from keras.models import Model
    import keras.backend as K




    def createModel():
    inp = Input(batch_shape=(1,None,3))
    out = LSTM(5,return_sequences=True, stateful=True)(inp)
    out = LSTM(2, stateful=True)(out)
    out = Dense(1)(out)
    model = Model(inp,out)
    return model


    def saveStates(model, saveName):

    f = h5py.File(saveName,'w')

    for l, lay in enumerate(model.layers):
    #if you have nested models, consider making this recurrent testing for layers in layers
    if isinstance(lay,RNN):
    for s, stat in enumerate(lay.states):
    f.create_dataset('states_' + str(l) + '_' + str(s), data=K.eval(stat), dtype=K.dtype(stat))

    f.close()


    def loadStates(model, saveName):

    f = h5py.File(saveName, 'r')
    allStates = list(f.keys())

    for stateKey in allStates:
    name, layer, state = stateKey.split('_')
    layer = int(layer)
    state = int(state)

    K.set_value(model.layers[layer].states[state], f.get(stateKey))

    f.close()

    def printStates(model):

    for l in model.layers:
    #if you have nested models, consider making this recurrent testing for layers in layers
    if isinstance(l,RNN):
    for s in l.states:
    print(K.eval(s))

    model1 = createModel()
    model2 = createModel()
    model1.predict_on_batch(np.ones((1,5,3))) #changes model 1 states

    print('model1')
    printStates(model1)
    print('model2')
    printStates(model2)

    saveStates(model1,'testStates5')
    loadStates(model2,'testStates5')

    print('model1')
    printStates(model1)
    print('model2')
    printStates(model2)


    Considerations on the aspects of the data



    In your first model (if it is stateful=False), it considers that each sequence in m is individual and not connected to the others. It also considers that each batch contains unique sequences.



    If this is not the case, you might want to train the stateful model instead (considering that each sequence is actually connected to the previous sequence). And then you would need m batches of 1 sequence. -> m x (1, 7 or None, 3).






    share|improve this answer




















    • 2





      I thought reset_states() would mess the trained parameters, but I can confirm it doesn't, so this in combination with predict_on_batch (which overcomes the problem of having to specify a batch_size for stateful=True) makes it a very compact and elegant solution that I will definitely use myself! The downside of less verbosity is that it doesn't address the requirement of saving the LSTM intermediate state in a persistent way

      – fr_andres
      Nov 14 '18 at 18:11











    • @fr_andres , this now saves and loads states.

      – Daniel Möller
      Nov 14 '18 at 19:56











    • @DanielMöller I have a problem with stateful solutions mentioned here. I would appreciate if you could tell me where I am wrong: as far as I know, stateful LSTM layer means that one sample in a batch is the successor of the corresponding sample in previous batch (from keras docs: If x1 and x2 are successive batches of samples, then x2[i] is the follow-up sequence to x1[i], for every i). That's why you correctly mentioned that batch sizes must be the same. However, the OP does not mention that the >>>>

      – today
      Nov 15 '18 at 8:09











    • @DanielMöller >>> statefulness assumption holds for the training data and in training phase. So setting stateful=True from the beginning (i.e. in the train model) may completely destroy the learning process. And if you don't set the stateful=True argument in the train model, you cannot modify it in its replica model for inference phase. What am I missing?

      – today
      Nov 15 '18 at 8:12












    • @DanielMöller I think you can just set the weights in your inference model as you did in your answer and enable statefulness in that only. I got my answer. Thanks.

      – today
      Nov 15 '18 at 16:47















    4





    +500









    I think there might be an easier solution.



    If your model does not have convolutional layers or any other layers that act upon the length/steps dimension, you can simply mark it as stateful=True



    Warning: your model has layers that act on the length dimension !!



    The Flatten layer transforms the length dimension into a feature dimension. This will completely prevent you from achieving your goal. If the Flatten layer is expecting 7 steps, you will always need 7 steps.



    So, before applying my answer below, fix your model to not use the Flatten layer. Instead, it can just remove the return_sequences=True for the last LSTM layer.



    The following code fixed that and also prepares a few things to be used with the answer below:



    def createModel(forTraining):

    #model for training, stateful=False, any batch size
    if forTraining == True:
    batchSize = None
    stateful = False

    #model for predicting, stateful=True, fixed batch size
    else:
    batchSize = 1
    stateful = True

    model = Sequential()

    first_lstm = LSTM(32,
    batch_input_shape=(batchSize, num_samples, num_features),
    return_sequences=True, activation='tanh',
    stateful=stateful)

    model.add(first_lstm)
    model.add(LeakyReLU())
    model.add(Dropout(0.2))

    #this is the last LSTM layer, use return_sequences=False
    model.add(LSTM(16, return_sequences=False, stateful=stateful, activation='tanh'))

    model.add(Dropout(0.2))
    model.add(LeakyReLU())

    #don't add a Flatten!!!
    #model.add(Flatten())

    model.add(Dense(1, activation='sigmoid'))

    if forTraining == True:
    compileThisModel(model)


    With this, you will be able to train with 7 steps and predict with one step. Otherwise it will not be possible.



    The usage of a stateful model as a solution for your question



    First, train this new model again, because it has no Flatten layer:



    trainingModel = createModel(forTraining=True)
    trainThisModel(trainingModel)


    Now, with this trained model, you can simply create a new model exactly the same way you created the trained model, but marking stateful=True in all its LSTM layers. And we should copy the weights from the trained model.



    Since these new layers will need a fixed batch size (Keras' rules), I assumed it would be 1 (one single stream is coming, not m streams) and added it to the model creation above.



    predictingModel = createModel(forTraining=False)
    predictingModel.set_weights(trainingModel.get_weights())


    And voilà. Just predict the outputs of the model with a single step:



    pseudo for loop as samples arrive to your model:
    prob = predictingModel.predict_on_batch(sample)

    #where sample.shape == (1, 1, 3)


    When you decide that you reached the end of what you consider a continuous sequence, call predictingModel.reset_states() so you can safely start a new sequence without the model thinking it should be mended at the end of the previous one.




    Saving and loading states



    Just get and set them, saving with h5py:



    def saveStates(model, saveName):

    f = h5py.File(saveName,'w')

    for l, lay in enumerate(model.layers):
    #if you have nested models,
    #consider making this recurrent testing for layers in layers
    if isinstance(lay,RNN):
    for s, stat in enumerate(lay.states):
    f.create_dataset('states_' + str(l) + '_' + str(s),
    data=K.eval(stat),
    dtype=K.dtype(stat))

    f.close()


    def loadStates(model, saveName):

    f = h5py.File(saveName, 'r')
    allStates = list(f.keys())

    for stateKey in allStates:
    name, layer, state = stateKey.split('_')
    layer = int(layer)
    state = int(state)

    K.set_value(model.layers[layer].states[state], f.get(stateKey))

    f.close()


    Working test for saving/loading states



    import h5py, numpy as np
    from keras.layers import RNN, LSTM, Dense, Input
    from keras.models import Model
    import keras.backend as K




    def createModel():
    inp = Input(batch_shape=(1,None,3))
    out = LSTM(5,return_sequences=True, stateful=True)(inp)
    out = LSTM(2, stateful=True)(out)
    out = Dense(1)(out)
    model = Model(inp,out)
    return model


    def saveStates(model, saveName):

    f = h5py.File(saveName,'w')

    for l, lay in enumerate(model.layers):
    #if you have nested models, consider making this recurrent testing for layers in layers
    if isinstance(lay,RNN):
    for s, stat in enumerate(lay.states):
    f.create_dataset('states_' + str(l) + '_' + str(s), data=K.eval(stat), dtype=K.dtype(stat))

    f.close()


    def loadStates(model, saveName):

    f = h5py.File(saveName, 'r')
    allStates = list(f.keys())

    for stateKey in allStates:
    name, layer, state = stateKey.split('_')
    layer = int(layer)
    state = int(state)

    K.set_value(model.layers[layer].states[state], f.get(stateKey))

    f.close()

    def printStates(model):

    for l in model.layers:
    #if you have nested models, consider making this recurrent testing for layers in layers
    if isinstance(l,RNN):
    for s in l.states:
    print(K.eval(s))

    model1 = createModel()
    model2 = createModel()
    model1.predict_on_batch(np.ones((1,5,3))) #changes model 1 states

    print('model1')
    printStates(model1)
    print('model2')
    printStates(model2)

    saveStates(model1,'testStates5')
    loadStates(model2,'testStates5')

    print('model1')
    printStates(model1)
    print('model2')
    printStates(model2)


    Considerations on the aspects of the data



    In your first model (if it is stateful=False), it considers that each sequence in m is individual and not connected to the others. It also considers that each batch contains unique sequences.



    If this is not the case, you might want to train the stateful model instead (considering that each sequence is actually connected to the previous sequence). And then you would need m batches of 1 sequence. -> m x (1, 7 or None, 3).






    share|improve this answer




















    • 2





      I thought reset_states() would mess the trained parameters, but I can confirm it doesn't, so this in combination with predict_on_batch (which overcomes the problem of having to specify a batch_size for stateful=True) makes it a very compact and elegant solution that I will definitely use myself! The downside of less verbosity is that it doesn't address the requirement of saving the LSTM intermediate state in a persistent way

      – fr_andres
      Nov 14 '18 at 18:11











    • @fr_andres , this now saves and loads states.

      – Daniel Möller
      Nov 14 '18 at 19:56











    • @DanielMöller I have a problem with stateful solutions mentioned here. I would appreciate if you could tell me where I am wrong: as far as I know, stateful LSTM layer means that one sample in a batch is the successor of the corresponding sample in previous batch (from keras docs: If x1 and x2 are successive batches of samples, then x2[i] is the follow-up sequence to x1[i], for every i). That's why you correctly mentioned that batch sizes must be the same. However, the OP does not mention that the >>>>

      – today
      Nov 15 '18 at 8:09











    • @DanielMöller >>> statefulness assumption holds for the training data and in training phase. So setting stateful=True from the beginning (i.e. in the train model) may completely destroy the learning process. And if you don't set the stateful=True argument in the train model, you cannot modify it in its replica model for inference phase. What am I missing?

      – today
      Nov 15 '18 at 8:12












    • @DanielMöller I think you can just set the weights in your inference model as you did in your answer and enable statefulness in that only. I got my answer. Thanks.

      – today
      Nov 15 '18 at 16:47













    4





    +500







    4





    +500



    4




    +500





    I think there might be an easier solution.



    If your model does not have convolutional layers or any other layers that act upon the length/steps dimension, you can simply mark it as stateful=True



    Warning: your model has layers that act on the length dimension !!



    The Flatten layer transforms the length dimension into a feature dimension. This will completely prevent you from achieving your goal. If the Flatten layer is expecting 7 steps, you will always need 7 steps.



    So, before applying my answer below, fix your model to not use the Flatten layer. Instead, it can just remove the return_sequences=True for the last LSTM layer.



    The following code fixed that and also prepares a few things to be used with the answer below:



    def createModel(forTraining):

    #model for training, stateful=False, any batch size
    if forTraining == True:
    batchSize = None
    stateful = False

    #model for predicting, stateful=True, fixed batch size
    else:
    batchSize = 1
    stateful = True

    model = Sequential()

    first_lstm = LSTM(32,
    batch_input_shape=(batchSize, num_samples, num_features),
    return_sequences=True, activation='tanh',
    stateful=stateful)

    model.add(first_lstm)
    model.add(LeakyReLU())
    model.add(Dropout(0.2))

    #this is the last LSTM layer, use return_sequences=False
    model.add(LSTM(16, return_sequences=False, stateful=stateful, activation='tanh'))

    model.add(Dropout(0.2))
    model.add(LeakyReLU())

    #don't add a Flatten!!!
    #model.add(Flatten())

    model.add(Dense(1, activation='sigmoid'))

    if forTraining == True:
    compileThisModel(model)


    With this, you will be able to train with 7 steps and predict with one step. Otherwise it will not be possible.



    The usage of a stateful model as a solution for your question



    First, train this new model again, because it has no Flatten layer:



    trainingModel = createModel(forTraining=True)
    trainThisModel(trainingModel)


    Now, with this trained model, you can simply create a new model exactly the same way you created the trained model, but marking stateful=True in all its LSTM layers. And we should copy the weights from the trained model.



    Since these new layers will need a fixed batch size (Keras' rules), I assumed it would be 1 (one single stream is coming, not m streams) and added it to the model creation above.



    predictingModel = createModel(forTraining=False)
    predictingModel.set_weights(trainingModel.get_weights())


    And voilà. Just predict the outputs of the model with a single step:



    pseudo for loop as samples arrive to your model:
    prob = predictingModel.predict_on_batch(sample)

    #where sample.shape == (1, 1, 3)


    When you decide that you reached the end of what you consider a continuous sequence, call predictingModel.reset_states() so you can safely start a new sequence without the model thinking it should be mended at the end of the previous one.




    Saving and loading states



    Just get and set them, saving with h5py:



    def saveStates(model, saveName):

    f = h5py.File(saveName,'w')

    for l, lay in enumerate(model.layers):
    #if you have nested models,
    #consider making this recurrent testing for layers in layers
    if isinstance(lay,RNN):
    for s, stat in enumerate(lay.states):
    f.create_dataset('states_' + str(l) + '_' + str(s),
    data=K.eval(stat),
    dtype=K.dtype(stat))

    f.close()


    def loadStates(model, saveName):

    f = h5py.File(saveName, 'r')
    allStates = list(f.keys())

    for stateKey in allStates:
    name, layer, state = stateKey.split('_')
    layer = int(layer)
    state = int(state)

    K.set_value(model.layers[layer].states[state], f.get(stateKey))

    f.close()


    Working test for saving/loading states



    import h5py, numpy as np
    from keras.layers import RNN, LSTM, Dense, Input
    from keras.models import Model
    import keras.backend as K




    def createModel():
    inp = Input(batch_shape=(1,None,3))
    out = LSTM(5,return_sequences=True, stateful=True)(inp)
    out = LSTM(2, stateful=True)(out)
    out = Dense(1)(out)
    model = Model(inp,out)
    return model


    def saveStates(model, saveName):

    f = h5py.File(saveName,'w')

    for l, lay in enumerate(model.layers):
    #if you have nested models, consider making this recurrent testing for layers in layers
    if isinstance(lay,RNN):
    for s, stat in enumerate(lay.states):
    f.create_dataset('states_' + str(l) + '_' + str(s), data=K.eval(stat), dtype=K.dtype(stat))

    f.close()


    def loadStates(model, saveName):

    f = h5py.File(saveName, 'r')
    allStates = list(f.keys())

    for stateKey in allStates:
    name, layer, state = stateKey.split('_')
    layer = int(layer)
    state = int(state)

    K.set_value(model.layers[layer].states[state], f.get(stateKey))

    f.close()

    def printStates(model):

    for l in model.layers:
    #if you have nested models, consider making this recurrent testing for layers in layers
    if isinstance(l,RNN):
    for s in l.states:
    print(K.eval(s))

    model1 = createModel()
    model2 = createModel()
    model1.predict_on_batch(np.ones((1,5,3))) #changes model 1 states

    print('model1')
    printStates(model1)
    print('model2')
    printStates(model2)

    saveStates(model1,'testStates5')
    loadStates(model2,'testStates5')

    print('model1')
    printStates(model1)
    print('model2')
    printStates(model2)


    Considerations on the aspects of the data



    In your first model (if it is stateful=False), it considers that each sequence in m is individual and not connected to the others. It also considers that each batch contains unique sequences.



    If this is not the case, you might want to train the stateful model instead (considering that each sequence is actually connected to the previous sequence). And then you would need m batches of 1 sequence. -> m x (1, 7 or None, 3).






    share|improve this answer















    I think there might be an easier solution.



    If your model does not have convolutional layers or any other layers that act upon the length/steps dimension, you can simply mark it as stateful=True



    Warning: your model has layers that act on the length dimension !!



    The Flatten layer transforms the length dimension into a feature dimension. This will completely prevent you from achieving your goal. If the Flatten layer is expecting 7 steps, you will always need 7 steps.



    So, before applying my answer below, fix your model to not use the Flatten layer. Instead, it can just remove the return_sequences=True for the last LSTM layer.



    The following code fixed that and also prepares a few things to be used with the answer below:



    def createModel(forTraining):

    #model for training, stateful=False, any batch size
    if forTraining == True:
    batchSize = None
    stateful = False

    #model for predicting, stateful=True, fixed batch size
    else:
    batchSize = 1
    stateful = True

    model = Sequential()

    first_lstm = LSTM(32,
    batch_input_shape=(batchSize, num_samples, num_features),
    return_sequences=True, activation='tanh',
    stateful=stateful)

    model.add(first_lstm)
    model.add(LeakyReLU())
    model.add(Dropout(0.2))

    #this is the last LSTM layer, use return_sequences=False
    model.add(LSTM(16, return_sequences=False, stateful=stateful, activation='tanh'))

    model.add(Dropout(0.2))
    model.add(LeakyReLU())

    #don't add a Flatten!!!
    #model.add(Flatten())

    model.add(Dense(1, activation='sigmoid'))

    if forTraining == True:
    compileThisModel(model)


    With this, you will be able to train with 7 steps and predict with one step. Otherwise it will not be possible.



    The usage of a stateful model as a solution for your question



    First, train this new model again, because it has no Flatten layer:



    trainingModel = createModel(forTraining=True)
    trainThisModel(trainingModel)


    Now, with this trained model, you can simply create a new model exactly the same way you created the trained model, but marking stateful=True in all its LSTM layers. And we should copy the weights from the trained model.



    Since these new layers will need a fixed batch size (Keras' rules), I assumed it would be 1 (one single stream is coming, not m streams) and added it to the model creation above.



    predictingModel = createModel(forTraining=False)
    predictingModel.set_weights(trainingModel.get_weights())


    And voilà. Just predict the outputs of the model with a single step:



    pseudo for loop as samples arrive to your model:
    prob = predictingModel.predict_on_batch(sample)

    #where sample.shape == (1, 1, 3)


    When you decide that you reached the end of what you consider a continuous sequence, call predictingModel.reset_states() so you can safely start a new sequence without the model thinking it should be mended at the end of the previous one.




    Saving and loading states



    Just get and set them, saving with h5py:



    def saveStates(model, saveName):

    f = h5py.File(saveName,'w')

    for l, lay in enumerate(model.layers):
    #if you have nested models,
    #consider making this recurrent testing for layers in layers
    if isinstance(lay,RNN):
    for s, stat in enumerate(lay.states):
    f.create_dataset('states_' + str(l) + '_' + str(s),
    data=K.eval(stat),
    dtype=K.dtype(stat))

    f.close()


    def loadStates(model, saveName):

    f = h5py.File(saveName, 'r')
    allStates = list(f.keys())

    for stateKey in allStates:
    name, layer, state = stateKey.split('_')
    layer = int(layer)
    state = int(state)

    K.set_value(model.layers[layer].states[state], f.get(stateKey))

    f.close()


    Working test for saving/loading states



    import h5py, numpy as np
    from keras.layers import RNN, LSTM, Dense, Input
    from keras.models import Model
    import keras.backend as K




    def createModel():
    inp = Input(batch_shape=(1,None,3))
    out = LSTM(5,return_sequences=True, stateful=True)(inp)
    out = LSTM(2, stateful=True)(out)
    out = Dense(1)(out)
    model = Model(inp,out)
    return model


    def saveStates(model, saveName):

    f = h5py.File(saveName,'w')

    for l, lay in enumerate(model.layers):
    #if you have nested models, consider making this recurrent testing for layers in layers
    if isinstance(lay,RNN):
    for s, stat in enumerate(lay.states):
    f.create_dataset('states_' + str(l) + '_' + str(s), data=K.eval(stat), dtype=K.dtype(stat))

    f.close()


    def loadStates(model, saveName):

    f = h5py.File(saveName, 'r')
    allStates = list(f.keys())

    for stateKey in allStates:
    name, layer, state = stateKey.split('_')
    layer = int(layer)
    state = int(state)

    K.set_value(model.layers[layer].states[state], f.get(stateKey))

    f.close()

    def printStates(model):

    for l in model.layers:
    #if you have nested models, consider making this recurrent testing for layers in layers
    if isinstance(l,RNN):
    for s in l.states:
    print(K.eval(s))

    model1 = createModel()
    model2 = createModel()
    model1.predict_on_batch(np.ones((1,5,3))) #changes model 1 states

    print('model1')
    printStates(model1)
    print('model2')
    printStates(model2)

    saveStates(model1,'testStates5')
    loadStates(model2,'testStates5')

    print('model1')
    printStates(model1)
    print('model2')
    printStates(model2)


    Considerations on the aspects of the data



    In your first model (if it is stateful=False), it considers that each sequence in m is individual and not connected to the others. It also considers that each batch contains unique sequences.



    If this is not the case, you might want to train the stateful model instead (considering that each sequence is actually connected to the previous sequence). And then you would need m batches of 1 sequence. -> m x (1, 7 or None, 3).







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 17 '18 at 18:16

























    answered Nov 14 '18 at 15:30









    Daniel MöllerDaniel Möller

    37k671108




    37k671108







    • 2





      I thought reset_states() would mess the trained parameters, but I can confirm it doesn't, so this in combination with predict_on_batch (which overcomes the problem of having to specify a batch_size for stateful=True) makes it a very compact and elegant solution that I will definitely use myself! The downside of less verbosity is that it doesn't address the requirement of saving the LSTM intermediate state in a persistent way

      – fr_andres
      Nov 14 '18 at 18:11











    • @fr_andres , this now saves and loads states.

      – Daniel Möller
      Nov 14 '18 at 19:56











    • @DanielMöller I have a problem with stateful solutions mentioned here. I would appreciate if you could tell me where I am wrong: as far as I know, stateful LSTM layer means that one sample in a batch is the successor of the corresponding sample in previous batch (from keras docs: If x1 and x2 are successive batches of samples, then x2[i] is the follow-up sequence to x1[i], for every i). That's why you correctly mentioned that batch sizes must be the same. However, the OP does not mention that the >>>>

      – today
      Nov 15 '18 at 8:09











    • @DanielMöller >>> statefulness assumption holds for the training data and in training phase. So setting stateful=True from the beginning (i.e. in the train model) may completely destroy the learning process. And if you don't set the stateful=True argument in the train model, you cannot modify it in its replica model for inference phase. What am I missing?

      – today
      Nov 15 '18 at 8:12












    • @DanielMöller I think you can just set the weights in your inference model as you did in your answer and enable statefulness in that only. I got my answer. Thanks.

      – today
      Nov 15 '18 at 16:47












    • 2





      I thought reset_states() would mess the trained parameters, but I can confirm it doesn't, so this in combination with predict_on_batch (which overcomes the problem of having to specify a batch_size for stateful=True) makes it a very compact and elegant solution that I will definitely use myself! The downside of less verbosity is that it doesn't address the requirement of saving the LSTM intermediate state in a persistent way

      – fr_andres
      Nov 14 '18 at 18:11











    • @fr_andres , this now saves and loads states.

      – Daniel Möller
      Nov 14 '18 at 19:56











    • @DanielMöller I have a problem with stateful solutions mentioned here. I would appreciate if you could tell me where I am wrong: as far as I know, stateful LSTM layer means that one sample in a batch is the successor of the corresponding sample in previous batch (from keras docs: If x1 and x2 are successive batches of samples, then x2[i] is the follow-up sequence to x1[i], for every i). That's why you correctly mentioned that batch sizes must be the same. However, the OP does not mention that the >>>>

      – today
      Nov 15 '18 at 8:09











    • @DanielMöller >>> statefulness assumption holds for the training data and in training phase. So setting stateful=True from the beginning (i.e. in the train model) may completely destroy the learning process. And if you don't set the stateful=True argument in the train model, you cannot modify it in its replica model for inference phase. What am I missing?

      – today
      Nov 15 '18 at 8:12












    • @DanielMöller I think you can just set the weights in your inference model as you did in your answer and enable statefulness in that only. I got my answer. Thanks.

      – today
      Nov 15 '18 at 16:47







    2




    2





    I thought reset_states() would mess the trained parameters, but I can confirm it doesn't, so this in combination with predict_on_batch (which overcomes the problem of having to specify a batch_size for stateful=True) makes it a very compact and elegant solution that I will definitely use myself! The downside of less verbosity is that it doesn't address the requirement of saving the LSTM intermediate state in a persistent way

    – fr_andres
    Nov 14 '18 at 18:11





    I thought reset_states() would mess the trained parameters, but I can confirm it doesn't, so this in combination with predict_on_batch (which overcomes the problem of having to specify a batch_size for stateful=True) makes it a very compact and elegant solution that I will definitely use myself! The downside of less verbosity is that it doesn't address the requirement of saving the LSTM intermediate state in a persistent way

    – fr_andres
    Nov 14 '18 at 18:11













    @fr_andres , this now saves and loads states.

    – Daniel Möller
    Nov 14 '18 at 19:56





    @fr_andres , this now saves and loads states.

    – Daniel Möller
    Nov 14 '18 at 19:56













    @DanielMöller I have a problem with stateful solutions mentioned here. I would appreciate if you could tell me where I am wrong: as far as I know, stateful LSTM layer means that one sample in a batch is the successor of the corresponding sample in previous batch (from keras docs: If x1 and x2 are successive batches of samples, then x2[i] is the follow-up sequence to x1[i], for every i). That's why you correctly mentioned that batch sizes must be the same. However, the OP does not mention that the >>>>

    – today
    Nov 15 '18 at 8:09





    @DanielMöller I have a problem with stateful solutions mentioned here. I would appreciate if you could tell me where I am wrong: as far as I know, stateful LSTM layer means that one sample in a batch is the successor of the corresponding sample in previous batch (from keras docs: If x1 and x2 are successive batches of samples, then x2[i] is the follow-up sequence to x1[i], for every i). That's why you correctly mentioned that batch sizes must be the same. However, the OP does not mention that the >>>>

    – today
    Nov 15 '18 at 8:09













    @DanielMöller >>> statefulness assumption holds for the training data and in training phase. So setting stateful=True from the beginning (i.e. in the train model) may completely destroy the learning process. And if you don't set the stateful=True argument in the train model, you cannot modify it in its replica model for inference phase. What am I missing?

    – today
    Nov 15 '18 at 8:12






    @DanielMöller >>> statefulness assumption holds for the training data and in training phase. So setting stateful=True from the beginning (i.e. in the train model) may completely destroy the learning process. And if you don't set the stateful=True argument in the train model, you cannot modify it in its replica model for inference phase. What am I missing?

    – today
    Nov 15 '18 at 8:12














    @DanielMöller I think you can just set the weights in your inference model as you did in your answer and enable statefulness in that only. I got my answer. Thanks.

    – today
    Nov 15 '18 at 16:47





    @DanielMöller I think you can just set the weights in your inference model as you did in your answer and enable statefulness in that only. I got my answer. Thanks.

    – today
    Nov 15 '18 at 16:47













    4














    If I understood correctly, you have batches of m sequences, each of length 7, whose elements are 3-dimensional vectors (so batch has shape (m*7*3)).
    In any Keras RNN you can set the
    return_sequences flag to True to become the intermediate states, i.e., for every batch, instead of the definitive prediction, you will get the corresponding 7 outputs, where output i represents the prediction at stage i given all inputs from 0 to i.



    But you would be getting all at once at the end. As far as I know, Keras doesn't provide a direct interface for retrieving the throughput whilst the batch is being processed. This may be even more constrained if you are using any of the CUDNN-optimized variants. What you can do is basically to regard your batch as 7 succesive batches of shape (m*1*3), and feed them progressively to your LSTM, recording the hidden state and prediction at each step. For that, you can either set return_state to True and do it manually, or you can simply set statefulto True and let the object keep track of it.




    The following Python2+Keras example should exactly represent what you want. Specifically:



    • allowing to save the whole LSTM intermediate state in a persistent way

    • while waiting for the next sample

    • and predicting on a model trained on a specific batch size that may be arbitrary and unknown.

    For that, it includes an example of stateful=True for easiest training, and return_state=True for most precise inference, so you get a flavor of both approaches. It also assumes that you get a model that has been serialized and from which you don't know much about. The structure is closely related to the one in Andrew Ng's course, who is definitely more authoritative than me in the topic. Since you don't specify how the model has been trained, I assumed a many-to-one training setup, but this could be easily adapted.



    from __future__ import print_function
    from keras.layers import Input, LSTM, Dense
    from keras.models import Model, load_model
    from keras.optimizers import Adam
    import numpy as np

    # globals
    SEQ_LEN = 7
    HID_DIMS = 32
    OUTPUT_DIMS = 3 # outputs are assumed to be scalars


    ##############################################################################
    # define the model to be trained on a fixed batch size:
    # assume many-to-one training setup (otherwise set return_sequences=True)
    TRAIN_BATCH_SIZE = 20

    x_in = Input(batch_shape=[TRAIN_BATCH_SIZE, SEQ_LEN, 3])
    lstm = LSTM(HID_DIMS, activation="tanh", return_sequences=False, stateful=True)
    dense = Dense(OUTPUT_DIMS, activation='linear')
    m_train = Model(inputs=x_in, outputs=dense(lstm(x_in)))
    m_train.summary()

    # a dummy batch of training data of shape (TRAIN_BATCH_SIZE, SEQ_LEN, 3), with targets of shape (TRAIN_BATCH_SIZE, 3):
    batch123 = np.repeat([[1, 2, 3]], SEQ_LEN, axis=0).reshape(1, SEQ_LEN, 3).repeat(TRAIN_BATCH_SIZE, axis=0)
    targets = np.repeat([[123,234,345]], TRAIN_BATCH_SIZE, axis=0) # dummy [[1,2,3],,,]-> [123,234,345] mapping to be learned


    # train the model on a fixed batch size and save it
    print(">> INFERECE BEFORE TRAINING MODEL:", m_train.predict(batch123, batch_size=TRAIN_BATCH_SIZE, verbose=0))
    m_train.compile(optimizer=Adam(lr=0.5), loss='mean_squared_error', metrics=['mae'])
    m_train.fit(batch123, targets, epochs=100, batch_size=TRAIN_BATCH_SIZE)
    m_train.save("trained_lstm.h5")
    print(">> INFERECE AFTER TRAINING MODEL:", m_train.predict(batch123, batch_size=TRAIN_BATCH_SIZE, verbose=0))


    ##############################################################################
    # Now, although we aren't training anymore, we want to do step-wise predictions
    # that do alter the inner state of the model, and keep track of that.


    m_trained = load_model("trained_lstm.h5")
    print(">> INFERECE AFTER RELOADING TRAINED MODEL:", m_trained.predict(batch123, batch_size=TRAIN_BATCH_SIZE, verbose=0))

    # now define an analogous model that allows a flexible batch size for inference:
    x_in = Input(shape=[SEQ_LEN, 3])
    h_in = Input(shape=[HID_DIMS])
    c_in = Input(shape=[HID_DIMS])
    pred_lstm = LSTM(HID_DIMS, activation="tanh", return_sequences=False, return_state=True, name="lstm_infer")
    h, cc, c = pred_lstm(x_in, initial_state=[h_in, c_in])
    prediction = Dense(OUTPUT_DIMS, activation='linear', name="dense_infer")(h)
    m_inference = Model(inputs=[x_in, h_in, c_in], outputs=[prediction, h,cc,c])

    # Let's confirm that this model is able to load the trained parameters:
    # first, check that the performance from scratch is not good:
    print(">> INFERENCE BEFORE SWAPPING MODEL:")
    predictions, hs, zs, cs = m_inference.predict([batch123,
    np.zeros((TRAIN_BATCH_SIZE, HID_DIMS)),
    np.zeros((TRAIN_BATCH_SIZE, HID_DIMS))],
    batch_size=1)
    print(predictions)


    # import state from the trained model state and check that it works:
    print(">> INFERENCE AFTER SWAPPING MODEL:")
    for layer in m_trained.layers:
    if "lstm" in layer.name:
    m_inference.get_layer("lstm_infer").set_weights(layer.get_weights())
    elif "dense" in layer.name:
    m_inference.get_layer("dense_infer").set_weights(layer.get_weights())

    predictions, _, _, _ = m_inference.predict([batch123,
    np.zeros((TRAIN_BATCH_SIZE, HID_DIMS)),
    np.zeros((TRAIN_BATCH_SIZE, HID_DIMS))],
    batch_size=1)
    print(predictions)


    # finally perform granular predictions while keeping the recurrent activations. Starting the sequence with zeros is a common practice, but depending on how you trained, you might have an <END_OF_SEQUENCE> character that you might want to propagate instead:
    h, c = np.zeros((TRAIN_BATCH_SIZE, HID_DIMS)), np.zeros((TRAIN_BATCH_SIZE, HID_DIMS))
    for i in range(len(batch123)):
    # about output shape: https://keras.io/layers/recurrent/#rnn
    # h,z,c hold the network's throughput: h is the proper LSTM output, c is the accumulator and cc is (probably) the candidate
    current_input = batch123[i:i+1] # the length of this feed is arbitrary, doesn't have to be 1
    pred, h, cc, c = m_inference.predict([current_input, h, c])
    print("input:", current_input)
    print("output:", pred)
    print(h.shape, cc.shape, c.shape)
    raw_input("do something with your prediction and hidden state and press any key to continue")



    Additional information:



    Since we have two forms of state persistency:

    1. The saved/trained parameters of the model that are the same for each sequence

    2. The a, c states that evolve throughout the sequences and may be "restarted"



    It is interesting to take a look at the guts of the LSTM object. In the Python example that I provide, the a and c weights are explicitly handled, but the trained parameters aren't, and it may not be obvious how they are internally implemented or what do they mean. They can be inspected as follows:



    for w in lstm.weights:
    print(w.name, w.shape)


    In our case (32 hidden states) returns the following:



    lstm_1/kernel:0 (3, 128)
    lstm_1/recurrent_kernel:0 (32, 128)
    lstm_1/bias:0 (128,)


    We observe a dimensionality of 128. Why is that? this link describes the Keras LSTM implementation as follows:



    enter image description here




    The g is the recurrent activation, p is the activation, Ws are the kernels, Us are the recurrent kernels, h is the hidden variable which is the output too and the notation * is an element-wise multiplication.




    Which explains the 128=32*4 being the parameters for the affine transformation happening inside each one of the 4 gates, concatenated:



    • The matrix of shape (3, 128) (named kernel) handles the input for a given sequence element

    • The matrix of shape (32, 128) (named recurrent_kernel) handles the input for the last recurrent state h.

    • The vector of shape (128,) (named bias), as usual in any other NN setup.





    share|improve this answer

























    • Thanks, for this amazing explanation, I'll give it a go and will update.

      – Shlomi Schwartz
      Nov 8 '18 at 9:21











    • If I use your suggestion of spliting the batch to single samples, does that mean I need to retrain my model on the same input and train on (m*1*3) shape, like so? [[1,2,3],[1,2,3]...7 samples] & target each with the coresponding batch label like so? [1,1,1,1,1,1,1,0,0,0,0,0,0,0] where each 7 lables are the same for each batch.

      – Shlomi Schwartz
      Nov 8 '18 at 9:31











    • If you keep the hidden state, performing m one-element forward propagations is equivalent to performing one m-element forward propagation. But if you want the (speedup&regularization) advantages of batch_size>1 you do have to wait until you have all results and then average them (if I understood your question correctly you don't want to refrain from that). For that, either you have two setups, one batched for training and one with for loop for production, or you can add a concatenator to the forapproach to artificially build-up your batch

      – fr_andres
      Nov 8 '18 at 14:28






    • 1





      In other words, if you trained it with batches and went well, you can still use the same model to perform one-by-one predictions like this: just be sure that the c hidden state remains untouched, and the h-state adapts to the current situation (depending on how you trained, you may have an "initial state" to start a fresh sequence, or an "end-of-sequence" character that you can push repeatedly to reset the h-state)

      – fr_andres
      Nov 8 '18 at 14:31







    • 1





      The best way is to use a Model as in my example, then you have plenty of built-in support for serialization. Once you have numpy, json or whatever standard format the backend is pretty much a matter of choice (HDF5 is precisely a compromise between filesystem and kvdb)

      – fr_andres
      Nov 8 '18 at 16:46















    4














    If I understood correctly, you have batches of m sequences, each of length 7, whose elements are 3-dimensional vectors (so batch has shape (m*7*3)).
    In any Keras RNN you can set the
    return_sequences flag to True to become the intermediate states, i.e., for every batch, instead of the definitive prediction, you will get the corresponding 7 outputs, where output i represents the prediction at stage i given all inputs from 0 to i.



    But you would be getting all at once at the end. As far as I know, Keras doesn't provide a direct interface for retrieving the throughput whilst the batch is being processed. This may be even more constrained if you are using any of the CUDNN-optimized variants. What you can do is basically to regard your batch as 7 succesive batches of shape (m*1*3), and feed them progressively to your LSTM, recording the hidden state and prediction at each step. For that, you can either set return_state to True and do it manually, or you can simply set statefulto True and let the object keep track of it.




    The following Python2+Keras example should exactly represent what you want. Specifically:



    • allowing to save the whole LSTM intermediate state in a persistent way

    • while waiting for the next sample

    • and predicting on a model trained on a specific batch size that may be arbitrary and unknown.

    For that, it includes an example of stateful=True for easiest training, and return_state=True for most precise inference, so you get a flavor of both approaches. It also assumes that you get a model that has been serialized and from which you don't know much about. The structure is closely related to the one in Andrew Ng's course, who is definitely more authoritative than me in the topic. Since you don't specify how the model has been trained, I assumed a many-to-one training setup, but this could be easily adapted.



    from __future__ import print_function
    from keras.layers import Input, LSTM, Dense
    from keras.models import Model, load_model
    from keras.optimizers import Adam
    import numpy as np

    # globals
    SEQ_LEN = 7
    HID_DIMS = 32
    OUTPUT_DIMS = 3 # outputs are assumed to be scalars


    ##############################################################################
    # define the model to be trained on a fixed batch size:
    # assume many-to-one training setup (otherwise set return_sequences=True)
    TRAIN_BATCH_SIZE = 20

    x_in = Input(batch_shape=[TRAIN_BATCH_SIZE, SEQ_LEN, 3])
    lstm = LSTM(HID_DIMS, activation="tanh", return_sequences=False, stateful=True)
    dense = Dense(OUTPUT_DIMS, activation='linear')
    m_train = Model(inputs=x_in, outputs=dense(lstm(x_in)))
    m_train.summary()

    # a dummy batch of training data of shape (TRAIN_BATCH_SIZE, SEQ_LEN, 3), with targets of shape (TRAIN_BATCH_SIZE, 3):
    batch123 = np.repeat([[1, 2, 3]], SEQ_LEN, axis=0).reshape(1, SEQ_LEN, 3).repeat(TRAIN_BATCH_SIZE, axis=0)
    targets = np.repeat([[123,234,345]], TRAIN_BATCH_SIZE, axis=0) # dummy [[1,2,3],,,]-> [123,234,345] mapping to be learned


    # train the model on a fixed batch size and save it
    print(">> INFERECE BEFORE TRAINING MODEL:", m_train.predict(batch123, batch_size=TRAIN_BATCH_SIZE, verbose=0))
    m_train.compile(optimizer=Adam(lr=0.5), loss='mean_squared_error', metrics=['mae'])
    m_train.fit(batch123, targets, epochs=100, batch_size=TRAIN_BATCH_SIZE)
    m_train.save("trained_lstm.h5")
    print(">> INFERECE AFTER TRAINING MODEL:", m_train.predict(batch123, batch_size=TRAIN_BATCH_SIZE, verbose=0))


    ##############################################################################
    # Now, although we aren't training anymore, we want to do step-wise predictions
    # that do alter the inner state of the model, and keep track of that.


    m_trained = load_model("trained_lstm.h5")
    print(">> INFERECE AFTER RELOADING TRAINED MODEL:", m_trained.predict(batch123, batch_size=TRAIN_BATCH_SIZE, verbose=0))

    # now define an analogous model that allows a flexible batch size for inference:
    x_in = Input(shape=[SEQ_LEN, 3])
    h_in = Input(shape=[HID_DIMS])
    c_in = Input(shape=[HID_DIMS])
    pred_lstm = LSTM(HID_DIMS, activation="tanh", return_sequences=False, return_state=True, name="lstm_infer")
    h, cc, c = pred_lstm(x_in, initial_state=[h_in, c_in])
    prediction = Dense(OUTPUT_DIMS, activation='linear', name="dense_infer")(h)
    m_inference = Model(inputs=[x_in, h_in, c_in], outputs=[prediction, h,cc,c])

    # Let's confirm that this model is able to load the trained parameters:
    # first, check that the performance from scratch is not good:
    print(">> INFERENCE BEFORE SWAPPING MODEL:")
    predictions, hs, zs, cs = m_inference.predict([batch123,
    np.zeros((TRAIN_BATCH_SIZE, HID_DIMS)),
    np.zeros((TRAIN_BATCH_SIZE, HID_DIMS))],
    batch_size=1)
    print(predictions)


    # import state from the trained model state and check that it works:
    print(">> INFERENCE AFTER SWAPPING MODEL:")
    for layer in m_trained.layers:
    if "lstm" in layer.name:
    m_inference.get_layer("lstm_infer").set_weights(layer.get_weights())
    elif "dense" in layer.name:
    m_inference.get_layer("dense_infer").set_weights(layer.get_weights())

    predictions, _, _, _ = m_inference.predict([batch123,
    np.zeros((TRAIN_BATCH_SIZE, HID_DIMS)),
    np.zeros((TRAIN_BATCH_SIZE, HID_DIMS))],
    batch_size=1)
    print(predictions)


    # finally perform granular predictions while keeping the recurrent activations. Starting the sequence with zeros is a common practice, but depending on how you trained, you might have an <END_OF_SEQUENCE> character that you might want to propagate instead:
    h, c = np.zeros((TRAIN_BATCH_SIZE, HID_DIMS)), np.zeros((TRAIN_BATCH_SIZE, HID_DIMS))
    for i in range(len(batch123)):
    # about output shape: https://keras.io/layers/recurrent/#rnn
    # h,z,c hold the network's throughput: h is the proper LSTM output, c is the accumulator and cc is (probably) the candidate
    current_input = batch123[i:i+1] # the length of this feed is arbitrary, doesn't have to be 1
    pred, h, cc, c = m_inference.predict([current_input, h, c])
    print("input:", current_input)
    print("output:", pred)
    print(h.shape, cc.shape, c.shape)
    raw_input("do something with your prediction and hidden state and press any key to continue")



    Additional information:



    Since we have two forms of state persistency:

    1. The saved/trained parameters of the model that are the same for each sequence

    2. The a, c states that evolve throughout the sequences and may be "restarted"



    It is interesting to take a look at the guts of the LSTM object. In the Python example that I provide, the a and c weights are explicitly handled, but the trained parameters aren't, and it may not be obvious how they are internally implemented or what do they mean. They can be inspected as follows:



    for w in lstm.weights:
    print(w.name, w.shape)


    In our case (32 hidden states) returns the following:



    lstm_1/kernel:0 (3, 128)
    lstm_1/recurrent_kernel:0 (32, 128)
    lstm_1/bias:0 (128,)


    We observe a dimensionality of 128. Why is that? this link describes the Keras LSTM implementation as follows:



    enter image description here




    The g is the recurrent activation, p is the activation, Ws are the kernels, Us are the recurrent kernels, h is the hidden variable which is the output too and the notation * is an element-wise multiplication.




    Which explains the 128=32*4 being the parameters for the affine transformation happening inside each one of the 4 gates, concatenated:



    • The matrix of shape (3, 128) (named kernel) handles the input for a given sequence element

    • The matrix of shape (32, 128) (named recurrent_kernel) handles the input for the last recurrent state h.

    • The vector of shape (128,) (named bias), as usual in any other NN setup.





    share|improve this answer

























    • Thanks, for this amazing explanation, I'll give it a go and will update.

      – Shlomi Schwartz
      Nov 8 '18 at 9:21











    • If I use your suggestion of spliting the batch to single samples, does that mean I need to retrain my model on the same input and train on (m*1*3) shape, like so? [[1,2,3],[1,2,3]...7 samples] & target each with the coresponding batch label like so? [1,1,1,1,1,1,1,0,0,0,0,0,0,0] where each 7 lables are the same for each batch.

      – Shlomi Schwartz
      Nov 8 '18 at 9:31











    • If you keep the hidden state, performing m one-element forward propagations is equivalent to performing one m-element forward propagation. But if you want the (speedup&regularization) advantages of batch_size>1 you do have to wait until you have all results and then average them (if I understood your question correctly you don't want to refrain from that). For that, either you have two setups, one batched for training and one with for loop for production, or you can add a concatenator to the forapproach to artificially build-up your batch

      – fr_andres
      Nov 8 '18 at 14:28






    • 1





      In other words, if you trained it with batches and went well, you can still use the same model to perform one-by-one predictions like this: just be sure that the c hidden state remains untouched, and the h-state adapts to the current situation (depending on how you trained, you may have an "initial state" to start a fresh sequence, or an "end-of-sequence" character that you can push repeatedly to reset the h-state)

      – fr_andres
      Nov 8 '18 at 14:31







    • 1





      The best way is to use a Model as in my example, then you have plenty of built-in support for serialization. Once you have numpy, json or whatever standard format the backend is pretty much a matter of choice (HDF5 is precisely a compromise between filesystem and kvdb)

      – fr_andres
      Nov 8 '18 at 16:46













    4












    4








    4







    If I understood correctly, you have batches of m sequences, each of length 7, whose elements are 3-dimensional vectors (so batch has shape (m*7*3)).
    In any Keras RNN you can set the
    return_sequences flag to True to become the intermediate states, i.e., for every batch, instead of the definitive prediction, you will get the corresponding 7 outputs, where output i represents the prediction at stage i given all inputs from 0 to i.



    But you would be getting all at once at the end. As far as I know, Keras doesn't provide a direct interface for retrieving the throughput whilst the batch is being processed. This may be even more constrained if you are using any of the CUDNN-optimized variants. What you can do is basically to regard your batch as 7 succesive batches of shape (m*1*3), and feed them progressively to your LSTM, recording the hidden state and prediction at each step. For that, you can either set return_state to True and do it manually, or you can simply set statefulto True and let the object keep track of it.




    The following Python2+Keras example should exactly represent what you want. Specifically:



    • allowing to save the whole LSTM intermediate state in a persistent way

    • while waiting for the next sample

    • and predicting on a model trained on a specific batch size that may be arbitrary and unknown.

    For that, it includes an example of stateful=True for easiest training, and return_state=True for most precise inference, so you get a flavor of both approaches. It also assumes that you get a model that has been serialized and from which you don't know much about. The structure is closely related to the one in Andrew Ng's course, who is definitely more authoritative than me in the topic. Since you don't specify how the model has been trained, I assumed a many-to-one training setup, but this could be easily adapted.



    from __future__ import print_function
    from keras.layers import Input, LSTM, Dense
    from keras.models import Model, load_model
    from keras.optimizers import Adam
    import numpy as np

    # globals
    SEQ_LEN = 7
    HID_DIMS = 32
    OUTPUT_DIMS = 3 # outputs are assumed to be scalars


    ##############################################################################
    # define the model to be trained on a fixed batch size:
    # assume many-to-one training setup (otherwise set return_sequences=True)
    TRAIN_BATCH_SIZE = 20

    x_in = Input(batch_shape=[TRAIN_BATCH_SIZE, SEQ_LEN, 3])
    lstm = LSTM(HID_DIMS, activation="tanh", return_sequences=False, stateful=True)
    dense = Dense(OUTPUT_DIMS, activation='linear')
    m_train = Model(inputs=x_in, outputs=dense(lstm(x_in)))
    m_train.summary()

    # a dummy batch of training data of shape (TRAIN_BATCH_SIZE, SEQ_LEN, 3), with targets of shape (TRAIN_BATCH_SIZE, 3):
    batch123 = np.repeat([[1, 2, 3]], SEQ_LEN, axis=0).reshape(1, SEQ_LEN, 3).repeat(TRAIN_BATCH_SIZE, axis=0)
    targets = np.repeat([[123,234,345]], TRAIN_BATCH_SIZE, axis=0) # dummy [[1,2,3],,,]-> [123,234,345] mapping to be learned


    # train the model on a fixed batch size and save it
    print(">> INFERECE BEFORE TRAINING MODEL:", m_train.predict(batch123, batch_size=TRAIN_BATCH_SIZE, verbose=0))
    m_train.compile(optimizer=Adam(lr=0.5), loss='mean_squared_error', metrics=['mae'])
    m_train.fit(batch123, targets, epochs=100, batch_size=TRAIN_BATCH_SIZE)
    m_train.save("trained_lstm.h5")
    print(">> INFERECE AFTER TRAINING MODEL:", m_train.predict(batch123, batch_size=TRAIN_BATCH_SIZE, verbose=0))


    ##############################################################################
    # Now, although we aren't training anymore, we want to do step-wise predictions
    # that do alter the inner state of the model, and keep track of that.


    m_trained = load_model("trained_lstm.h5")
    print(">> INFERECE AFTER RELOADING TRAINED MODEL:", m_trained.predict(batch123, batch_size=TRAIN_BATCH_SIZE, verbose=0))

    # now define an analogous model that allows a flexible batch size for inference:
    x_in = Input(shape=[SEQ_LEN, 3])
    h_in = Input(shape=[HID_DIMS])
    c_in = Input(shape=[HID_DIMS])
    pred_lstm = LSTM(HID_DIMS, activation="tanh", return_sequences=False, return_state=True, name="lstm_infer")
    h, cc, c = pred_lstm(x_in, initial_state=[h_in, c_in])
    prediction = Dense(OUTPUT_DIMS, activation='linear', name="dense_infer")(h)
    m_inference = Model(inputs=[x_in, h_in, c_in], outputs=[prediction, h,cc,c])

    # Let's confirm that this model is able to load the trained parameters:
    # first, check that the performance from scratch is not good:
    print(">> INFERENCE BEFORE SWAPPING MODEL:")
    predictions, hs, zs, cs = m_inference.predict([batch123,
    np.zeros((TRAIN_BATCH_SIZE, HID_DIMS)),
    np.zeros((TRAIN_BATCH_SIZE, HID_DIMS))],
    batch_size=1)
    print(predictions)


    # import state from the trained model state and check that it works:
    print(">> INFERENCE AFTER SWAPPING MODEL:")
    for layer in m_trained.layers:
    if "lstm" in layer.name:
    m_inference.get_layer("lstm_infer").set_weights(layer.get_weights())
    elif "dense" in layer.name:
    m_inference.get_layer("dense_infer").set_weights(layer.get_weights())

    predictions, _, _, _ = m_inference.predict([batch123,
    np.zeros((TRAIN_BATCH_SIZE, HID_DIMS)),
    np.zeros((TRAIN_BATCH_SIZE, HID_DIMS))],
    batch_size=1)
    print(predictions)


    # finally perform granular predictions while keeping the recurrent activations. Starting the sequence with zeros is a common practice, but depending on how you trained, you might have an <END_OF_SEQUENCE> character that you might want to propagate instead:
    h, c = np.zeros((TRAIN_BATCH_SIZE, HID_DIMS)), np.zeros((TRAIN_BATCH_SIZE, HID_DIMS))
    for i in range(len(batch123)):
    # about output shape: https://keras.io/layers/recurrent/#rnn
    # h,z,c hold the network's throughput: h is the proper LSTM output, c is the accumulator and cc is (probably) the candidate
    current_input = batch123[i:i+1] # the length of this feed is arbitrary, doesn't have to be 1
    pred, h, cc, c = m_inference.predict([current_input, h, c])
    print("input:", current_input)
    print("output:", pred)
    print(h.shape, cc.shape, c.shape)
    raw_input("do something with your prediction and hidden state and press any key to continue")



    Additional information:



    Since we have two forms of state persistency:

    1. The saved/trained parameters of the model that are the same for each sequence

    2. The a, c states that evolve throughout the sequences and may be "restarted"



    It is interesting to take a look at the guts of the LSTM object. In the Python example that I provide, the a and c weights are explicitly handled, but the trained parameters aren't, and it may not be obvious how they are internally implemented or what do they mean. They can be inspected as follows:



    for w in lstm.weights:
    print(w.name, w.shape)


    In our case (32 hidden states) returns the following:



    lstm_1/kernel:0 (3, 128)
    lstm_1/recurrent_kernel:0 (32, 128)
    lstm_1/bias:0 (128,)


    We observe a dimensionality of 128. Why is that? this link describes the Keras LSTM implementation as follows:



    enter image description here




    The g is the recurrent activation, p is the activation, Ws are the kernels, Us are the recurrent kernels, h is the hidden variable which is the output too and the notation * is an element-wise multiplication.




    Which explains the 128=32*4 being the parameters for the affine transformation happening inside each one of the 4 gates, concatenated:



    • The matrix of shape (3, 128) (named kernel) handles the input for a given sequence element

    • The matrix of shape (32, 128) (named recurrent_kernel) handles the input for the last recurrent state h.

    • The vector of shape (128,) (named bias), as usual in any other NN setup.





    share|improve this answer















    If I understood correctly, you have batches of m sequences, each of length 7, whose elements are 3-dimensional vectors (so batch has shape (m*7*3)).
    In any Keras RNN you can set the
    return_sequences flag to True to become the intermediate states, i.e., for every batch, instead of the definitive prediction, you will get the corresponding 7 outputs, where output i represents the prediction at stage i given all inputs from 0 to i.



    But you would be getting all at once at the end. As far as I know, Keras doesn't provide a direct interface for retrieving the throughput whilst the batch is being processed. This may be even more constrained if you are using any of the CUDNN-optimized variants. What you can do is basically to regard your batch as 7 succesive batches of shape (m*1*3), and feed them progressively to your LSTM, recording the hidden state and prediction at each step. For that, you can either set return_state to True and do it manually, or you can simply set statefulto True and let the object keep track of it.




    The following Python2+Keras example should exactly represent what you want. Specifically:



    • allowing to save the whole LSTM intermediate state in a persistent way

    • while waiting for the next sample

    • and predicting on a model trained on a specific batch size that may be arbitrary and unknown.

    For that, it includes an example of stateful=True for easiest training, and return_state=True for most precise inference, so you get a flavor of both approaches. It also assumes that you get a model that has been serialized and from which you don't know much about. The structure is closely related to the one in Andrew Ng's course, who is definitely more authoritative than me in the topic. Since you don't specify how the model has been trained, I assumed a many-to-one training setup, but this could be easily adapted.



    from __future__ import print_function
    from keras.layers import Input, LSTM, Dense
    from keras.models import Model, load_model
    from keras.optimizers import Adam
    import numpy as np

    # globals
    SEQ_LEN = 7
    HID_DIMS = 32
    OUTPUT_DIMS = 3 # outputs are assumed to be scalars


    ##############################################################################
    # define the model to be trained on a fixed batch size:
    # assume many-to-one training setup (otherwise set return_sequences=True)
    TRAIN_BATCH_SIZE = 20

    x_in = Input(batch_shape=[TRAIN_BATCH_SIZE, SEQ_LEN, 3])
    lstm = LSTM(HID_DIMS, activation="tanh", return_sequences=False, stateful=True)
    dense = Dense(OUTPUT_DIMS, activation='linear')
    m_train = Model(inputs=x_in, outputs=dense(lstm(x_in)))
    m_train.summary()

    # a dummy batch of training data of shape (TRAIN_BATCH_SIZE, SEQ_LEN, 3), with targets of shape (TRAIN_BATCH_SIZE, 3):
    batch123 = np.repeat([[1, 2, 3]], SEQ_LEN, axis=0).reshape(1, SEQ_LEN, 3).repeat(TRAIN_BATCH_SIZE, axis=0)
    targets = np.repeat([[123,234,345]], TRAIN_BATCH_SIZE, axis=0) # dummy [[1,2,3],,,]-> [123,234,345] mapping to be learned


    # train the model on a fixed batch size and save it
    print(">> INFERECE BEFORE TRAINING MODEL:", m_train.predict(batch123, batch_size=TRAIN_BATCH_SIZE, verbose=0))
    m_train.compile(optimizer=Adam(lr=0.5), loss='mean_squared_error', metrics=['mae'])
    m_train.fit(batch123, targets, epochs=100, batch_size=TRAIN_BATCH_SIZE)
    m_train.save("trained_lstm.h5")
    print(">> INFERECE AFTER TRAINING MODEL:", m_train.predict(batch123, batch_size=TRAIN_BATCH_SIZE, verbose=0))


    ##############################################################################
    # Now, although we aren't training anymore, we want to do step-wise predictions
    # that do alter the inner state of the model, and keep track of that.


    m_trained = load_model("trained_lstm.h5")
    print(">> INFERECE AFTER RELOADING TRAINED MODEL:", m_trained.predict(batch123, batch_size=TRAIN_BATCH_SIZE, verbose=0))

    # now define an analogous model that allows a flexible batch size for inference:
    x_in = Input(shape=[SEQ_LEN, 3])
    h_in = Input(shape=[HID_DIMS])
    c_in = Input(shape=[HID_DIMS])
    pred_lstm = LSTM(HID_DIMS, activation="tanh", return_sequences=False, return_state=True, name="lstm_infer")
    h, cc, c = pred_lstm(x_in, initial_state=[h_in, c_in])
    prediction = Dense(OUTPUT_DIMS, activation='linear', name="dense_infer")(h)
    m_inference = Model(inputs=[x_in, h_in, c_in], outputs=[prediction, h,cc,c])

    # Let's confirm that this model is able to load the trained parameters:
    # first, check that the performance from scratch is not good:
    print(">> INFERENCE BEFORE SWAPPING MODEL:")
    predictions, hs, zs, cs = m_inference.predict([batch123,
    np.zeros((TRAIN_BATCH_SIZE, HID_DIMS)),
    np.zeros((TRAIN_BATCH_SIZE, HID_DIMS))],
    batch_size=1)
    print(predictions)


    # import state from the trained model state and check that it works:
    print(">> INFERENCE AFTER SWAPPING MODEL:")
    for layer in m_trained.layers:
    if "lstm" in layer.name:
    m_inference.get_layer("lstm_infer").set_weights(layer.get_weights())
    elif "dense" in layer.name:
    m_inference.get_layer("dense_infer").set_weights(layer.get_weights())

    predictions, _, _, _ = m_inference.predict([batch123,
    np.zeros((TRAIN_BATCH_SIZE, HID_DIMS)),
    np.zeros((TRAIN_BATCH_SIZE, HID_DIMS))],
    batch_size=1)
    print(predictions)


    # finally perform granular predictions while keeping the recurrent activations. Starting the sequence with zeros is a common practice, but depending on how you trained, you might have an <END_OF_SEQUENCE> character that you might want to propagate instead:
    h, c = np.zeros((TRAIN_BATCH_SIZE, HID_DIMS)), np.zeros((TRAIN_BATCH_SIZE, HID_DIMS))
    for i in range(len(batch123)):
    # about output shape: https://keras.io/layers/recurrent/#rnn
    # h,z,c hold the network's throughput: h is the proper LSTM output, c is the accumulator and cc is (probably) the candidate
    current_input = batch123[i:i+1] # the length of this feed is arbitrary, doesn't have to be 1
    pred, h, cc, c = m_inference.predict([current_input, h, c])
    print("input:", current_input)
    print("output:", pred)
    print(h.shape, cc.shape, c.shape)
    raw_input("do something with your prediction and hidden state and press any key to continue")



    Additional information:



    Since we have two forms of state persistency:

    1. The saved/trained parameters of the model that are the same for each sequence

    2. The a, c states that evolve throughout the sequences and may be "restarted"



    It is interesting to take a look at the guts of the LSTM object. In the Python example that I provide, the a and c weights are explicitly handled, but the trained parameters aren't, and it may not be obvious how they are internally implemented or what do they mean. They can be inspected as follows:



    for w in lstm.weights:
    print(w.name, w.shape)


    In our case (32 hidden states) returns the following:



    lstm_1/kernel:0 (3, 128)
    lstm_1/recurrent_kernel:0 (32, 128)
    lstm_1/bias:0 (128,)


    We observe a dimensionality of 128. Why is that? this link describes the Keras LSTM implementation as follows:



    enter image description here




    The g is the recurrent activation, p is the activation, Ws are the kernels, Us are the recurrent kernels, h is the hidden variable which is the output too and the notation * is an element-wise multiplication.




    Which explains the 128=32*4 being the parameters for the affine transformation happening inside each one of the 4 gates, concatenated:



    • The matrix of shape (3, 128) (named kernel) handles the input for a given sequence element

    • The matrix of shape (32, 128) (named recurrent_kernel) handles the input for the last recurrent state h.

    • The vector of shape (128,) (named bias), as usual in any other NN setup.






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 15 '18 at 0:22









    Andy

    31.2k21104163




    31.2k21104163










    answered Nov 7 '18 at 16:11









    fr_andresfr_andres

    1,5171526




    1,5171526












    • Thanks, for this amazing explanation, I'll give it a go and will update.

      – Shlomi Schwartz
      Nov 8 '18 at 9:21











    • If I use your suggestion of spliting the batch to single samples, does that mean I need to retrain my model on the same input and train on (m*1*3) shape, like so? [[1,2,3],[1,2,3]...7 samples] & target each with the coresponding batch label like so? [1,1,1,1,1,1,1,0,0,0,0,0,0,0] where each 7 lables are the same for each batch.

      – Shlomi Schwartz
      Nov 8 '18 at 9:31











    • If you keep the hidden state, performing m one-element forward propagations is equivalent to performing one m-element forward propagation. But if you want the (speedup&regularization) advantages of batch_size>1 you do have to wait until you have all results and then average them (if I understood your question correctly you don't want to refrain from that). For that, either you have two setups, one batched for training and one with for loop for production, or you can add a concatenator to the forapproach to artificially build-up your batch

      – fr_andres
      Nov 8 '18 at 14:28






    • 1





      In other words, if you trained it with batches and went well, you can still use the same model to perform one-by-one predictions like this: just be sure that the c hidden state remains untouched, and the h-state adapts to the current situation (depending on how you trained, you may have an "initial state" to start a fresh sequence, or an "end-of-sequence" character that you can push repeatedly to reset the h-state)

      – fr_andres
      Nov 8 '18 at 14:31







    • 1





      The best way is to use a Model as in my example, then you have plenty of built-in support for serialization. Once you have numpy, json or whatever standard format the backend is pretty much a matter of choice (HDF5 is precisely a compromise between filesystem and kvdb)

      – fr_andres
      Nov 8 '18 at 16:46

















    • Thanks, for this amazing explanation, I'll give it a go and will update.

      – Shlomi Schwartz
      Nov 8 '18 at 9:21











    • If I use your suggestion of spliting the batch to single samples, does that mean I need to retrain my model on the same input and train on (m*1*3) shape, like so? [[1,2,3],[1,2,3]...7 samples] & target each with the coresponding batch label like so? [1,1,1,1,1,1,1,0,0,0,0,0,0,0] where each 7 lables are the same for each batch.

      – Shlomi Schwartz
      Nov 8 '18 at 9:31











    • If you keep the hidden state, performing m one-element forward propagations is equivalent to performing one m-element forward propagation. But if you want the (speedup&regularization) advantages of batch_size>1 you do have to wait until you have all results and then average them (if I understood your question correctly you don't want to refrain from that). For that, either you have two setups, one batched for training and one with for loop for production, or you can add a concatenator to the forapproach to artificially build-up your batch

      – fr_andres
      Nov 8 '18 at 14:28






    • 1





      In other words, if you trained it with batches and went well, you can still use the same model to perform one-by-one predictions like this: just be sure that the c hidden state remains untouched, and the h-state adapts to the current situation (depending on how you trained, you may have an "initial state" to start a fresh sequence, or an "end-of-sequence" character that you can push repeatedly to reset the h-state)

      – fr_andres
      Nov 8 '18 at 14:31







    • 1





      The best way is to use a Model as in my example, then you have plenty of built-in support for serialization. Once you have numpy, json or whatever standard format the backend is pretty much a matter of choice (HDF5 is precisely a compromise between filesystem and kvdb)

      – fr_andres
      Nov 8 '18 at 16:46
















    Thanks, for this amazing explanation, I'll give it a go and will update.

    – Shlomi Schwartz
    Nov 8 '18 at 9:21





    Thanks, for this amazing explanation, I'll give it a go and will update.

    – Shlomi Schwartz
    Nov 8 '18 at 9:21













    If I use your suggestion of spliting the batch to single samples, does that mean I need to retrain my model on the same input and train on (m*1*3) shape, like so? [[1,2,3],[1,2,3]...7 samples] & target each with the coresponding batch label like so? [1,1,1,1,1,1,1,0,0,0,0,0,0,0] where each 7 lables are the same for each batch.

    – Shlomi Schwartz
    Nov 8 '18 at 9:31





    If I use your suggestion of spliting the batch to single samples, does that mean I need to retrain my model on the same input and train on (m*1*3) shape, like so? [[1,2,3],[1,2,3]...7 samples] & target each with the coresponding batch label like so? [1,1,1,1,1,1,1,0,0,0,0,0,0,0] where each 7 lables are the same for each batch.

    – Shlomi Schwartz
    Nov 8 '18 at 9:31













    If you keep the hidden state, performing m one-element forward propagations is equivalent to performing one m-element forward propagation. But if you want the (speedup&regularization) advantages of batch_size>1 you do have to wait until you have all results and then average them (if I understood your question correctly you don't want to refrain from that). For that, either you have two setups, one batched for training and one with for loop for production, or you can add a concatenator to the forapproach to artificially build-up your batch

    – fr_andres
    Nov 8 '18 at 14:28





    If you keep the hidden state, performing m one-element forward propagations is equivalent to performing one m-element forward propagation. But if you want the (speedup&regularization) advantages of batch_size>1 you do have to wait until you have all results and then average them (if I understood your question correctly you don't want to refrain from that). For that, either you have two setups, one batched for training and one with for loop for production, or you can add a concatenator to the forapproach to artificially build-up your batch

    – fr_andres
    Nov 8 '18 at 14:28




    1




    1





    In other words, if you trained it with batches and went well, you can still use the same model to perform one-by-one predictions like this: just be sure that the c hidden state remains untouched, and the h-state adapts to the current situation (depending on how you trained, you may have an "initial state" to start a fresh sequence, or an "end-of-sequence" character that you can push repeatedly to reset the h-state)

    – fr_andres
    Nov 8 '18 at 14:31






    In other words, if you trained it with batches and went well, you can still use the same model to perform one-by-one predictions like this: just be sure that the c hidden state remains untouched, and the h-state adapts to the current situation (depending on how you trained, you may have an "initial state" to start a fresh sequence, or an "end-of-sequence" character that you can push repeatedly to reset the h-state)

    – fr_andres
    Nov 8 '18 at 14:31





    1




    1





    The best way is to use a Model as in my example, then you have plenty of built-in support for serialization. Once you have numpy, json or whatever standard format the backend is pretty much a matter of choice (HDF5 is precisely a compromise between filesystem and kvdb)

    – fr_andres
    Nov 8 '18 at 16:46





    The best way is to use a Model as in my example, then you have plenty of built-in support for serialization. Once you have numpy, json or whatever standard format the backend is pretty much a matter of choice (HDF5 is precisely a compromise between filesystem and kvdb)

    – fr_andres
    Nov 8 '18 at 16:46











    2














    Note: This answer assumes that your model in training phase is not stateful. You must understand what an stateful RNN layer is and make sure that the training data has the corresponding properties of statefulness. In short it means there is a dependency between the sequences, i.e. one sequence is the follow-up to another sequence, which you want to consider in your model. If your model and training data is stateful then I think other answers which involve setting stateful=True for the RNN layers from the beginning are simpler.



    Update: No matter the training model is stateful or not, you can always copy its weights to the inference model and enable statefulness. So I think solutions based on setting stateful=True are shorter and better than mine. Their only drawback is that the batch size in these solutions must be fixed.




    Note that the output of a LSTM layer over a single sequence is determined by its weight matrices, which are fixed, and its internal states which depends on the previous processed timestep. Now to get the output of LSTM layer for a single sequence of length m, one obvious way is to feed the entire sequence to the LSTM layer in one go. However, as I stated earlier, since its internal states depends on the previous timestep, we can exploit this fact and feed that single sequence chunk by chunk by getting the state of LSTM layer at the end of processing a chunk and pass it to the LSTM layer for processing the next chunk. To make it more clear, suppose the sequence length is 7 (i.e. it has 7 timesteps of fixed-length feature vectors). As an example, it is possible to process this sequence like this:



    1. Feed the timesteps 1 and 2 to the LSTM layer; get the final state (call it C1).

    2. Feed the timesteps 3, 4 and 5 and state C1 as the initial state to the LSTM layer; get the final state (call it C2).

    3. Feed the timesteps 6 and 7 and state C2 as the initial state to the LSTM layer; get the final output.

    That final output is equivalent to the output produced by the LSTM layer if we had feed it the entire 7 timesteps at once.



    So to realize this in Keras, you can set the return_state argument of LSTM layer to True so that you can get the intermediate state. Further, don't specify a fixed timestep length when defining the input layer. Instead use None to be able to feed the model with sequences of arbitrary length which enables us to process each sequence progressively (it's fine if your input data in training time are sequences of fixed-length).



    Since you need this chuck processing capability in inference time, we need to define a new model which shares the LSTM layer used in training model and can get the initial states as input and also gives the resulting states as output. The following is a general sketch of it could be done (note that the returned state of LSTM layer is not used when training the model, we only need it in test time):



    # define training model
    train_input = Input(shape=(None, n_feats)) # note that the number of timesteps is None
    lstm_layer = LSTM(n_units, return_state=True)
    lstm_output, _, _ = lstm_layer(train_input) # note that we ignore the returned states
    classifier = Dense(1, activation='sigmoid')
    train_output = classifier(lstm_output)

    train_model = Model(train_input, train_output)

    # compile and fit the model on training data ...

    # ==================================================

    # define inference model
    inf_input = Input(shape=(None, n_feats))
    state_h_input = Input(shape=(n_units,))
    state_c_input = Input(shape=(n_units,))

    # we use the layers of previous model
    lstm_output, state_h, state_c = lstm_layer(inf_input,
    initial_state=[state_h_input, state_c_input])
    output = classifier(lstm_output)

    inf_model = Model([inf_input, state_h_input, state_c_input],
    [output, state_h, state_c]) # note that we return the states as output


    Now you can feed the inf_model as much as the timesteps of a sequence are available right now. However, note that initially you must feed the states with vectors of all zeros (which is the default initial value of states). For example, if the sequence length is 7, a sketch of what happens when new data stream is available is as follows:



    state_h = np.zeros((1, n_units,))
    state_c = np.zeros((1, n_units))

    # three new timesteps are available
    outputs = inf_model.predict([timesteps, state_h, state_c])

    out = output[0,0] # you may ignore this output since the entire sequence has not been processed yet
    state_h = outputs[0,1]
    state_c = outputs[0,2]

    # after some time another four new timesteps are available
    outputs = inf_model.predict([timesteps, state_h, state_c])

    # we have processed 7 timesteps, so the output is valid
    out = output[0,0] # store it, pass it to another thread or do whatever you want to do with it

    # reinitialize the state to make them ready for the next sequence chunk
    state_h = np.zeros((1, n_units))
    state_c = np.zeros((1, n_units))

    # to be continued...


    Of course you need to do this in some kind of loop or implement a control flow structure to process the data stream, but I think you get what the general idea looks like.



    Finally, although your specific example is not a sequence-to-sequence model, but I highly recommend to read the official Keras seq2seq tutorial which I think one can learn a lot of ideas from it.






    share|improve this answer

























    • Thank you for your response it was very educational

      – Shlomi Schwartz
      Nov 14 '18 at 12:10











    • Could you please have a look here stackoverflow.com/questions/53376761/… ?: :)

      – Shlomi Schwartz
      Nov 19 '18 at 14:30















    2














    Note: This answer assumes that your model in training phase is not stateful. You must understand what an stateful RNN layer is and make sure that the training data has the corresponding properties of statefulness. In short it means there is a dependency between the sequences, i.e. one sequence is the follow-up to another sequence, which you want to consider in your model. If your model and training data is stateful then I think other answers which involve setting stateful=True for the RNN layers from the beginning are simpler.



    Update: No matter the training model is stateful or not, you can always copy its weights to the inference model and enable statefulness. So I think solutions based on setting stateful=True are shorter and better than mine. Their only drawback is that the batch size in these solutions must be fixed.




    Note that the output of a LSTM layer over a single sequence is determined by its weight matrices, which are fixed, and its internal states which depends on the previous processed timestep. Now to get the output of LSTM layer for a single sequence of length m, one obvious way is to feed the entire sequence to the LSTM layer in one go. However, as I stated earlier, since its internal states depends on the previous timestep, we can exploit this fact and feed that single sequence chunk by chunk by getting the state of LSTM layer at the end of processing a chunk and pass it to the LSTM layer for processing the next chunk. To make it more clear, suppose the sequence length is 7 (i.e. it has 7 timesteps of fixed-length feature vectors). As an example, it is possible to process this sequence like this:



    1. Feed the timesteps 1 and 2 to the LSTM layer; get the final state (call it C1).

    2. Feed the timesteps 3, 4 and 5 and state C1 as the initial state to the LSTM layer; get the final state (call it C2).

    3. Feed the timesteps 6 and 7 and state C2 as the initial state to the LSTM layer; get the final output.

    That final output is equivalent to the output produced by the LSTM layer if we had feed it the entire 7 timesteps at once.



    So to realize this in Keras, you can set the return_state argument of LSTM layer to True so that you can get the intermediate state. Further, don't specify a fixed timestep length when defining the input layer. Instead use None to be able to feed the model with sequences of arbitrary length which enables us to process each sequence progressively (it's fine if your input data in training time are sequences of fixed-length).



    Since you need this chuck processing capability in inference time, we need to define a new model which shares the LSTM layer used in training model and can get the initial states as input and also gives the resulting states as output. The following is a general sketch of it could be done (note that the returned state of LSTM layer is not used when training the model, we only need it in test time):



    # define training model
    train_input = Input(shape=(None, n_feats)) # note that the number of timesteps is None
    lstm_layer = LSTM(n_units, return_state=True)
    lstm_output, _, _ = lstm_layer(train_input) # note that we ignore the returned states
    classifier = Dense(1, activation='sigmoid')
    train_output = classifier(lstm_output)

    train_model = Model(train_input, train_output)

    # compile and fit the model on training data ...

    # ==================================================

    # define inference model
    inf_input = Input(shape=(None, n_feats))
    state_h_input = Input(shape=(n_units,))
    state_c_input = Input(shape=(n_units,))

    # we use the layers of previous model
    lstm_output, state_h, state_c = lstm_layer(inf_input,
    initial_state=[state_h_input, state_c_input])
    output = classifier(lstm_output)

    inf_model = Model([inf_input, state_h_input, state_c_input],
    [output, state_h, state_c]) # note that we return the states as output


    Now you can feed the inf_model as much as the timesteps of a sequence are available right now. However, note that initially you must feed the states with vectors of all zeros (which is the default initial value of states). For example, if the sequence length is 7, a sketch of what happens when new data stream is available is as follows:



    state_h = np.zeros((1, n_units,))
    state_c = np.zeros((1, n_units))

    # three new timesteps are available
    outputs = inf_model.predict([timesteps, state_h, state_c])

    out = output[0,0] # you may ignore this output since the entire sequence has not been processed yet
    state_h = outputs[0,1]
    state_c = outputs[0,2]

    # after some time another four new timesteps are available
    outputs = inf_model.predict([timesteps, state_h, state_c])

    # we have processed 7 timesteps, so the output is valid
    out = output[0,0] # store it, pass it to another thread or do whatever you want to do with it

    # reinitialize the state to make them ready for the next sequence chunk
    state_h = np.zeros((1, n_units))
    state_c = np.zeros((1, n_units))

    # to be continued...


    Of course you need to do this in some kind of loop or implement a control flow structure to process the data stream, but I think you get what the general idea looks like.



    Finally, although your specific example is not a sequence-to-sequence model, but I highly recommend to read the official Keras seq2seq tutorial which I think one can learn a lot of ideas from it.






    share|improve this answer

























    • Thank you for your response it was very educational

      – Shlomi Schwartz
      Nov 14 '18 at 12:10











    • Could you please have a look here stackoverflow.com/questions/53376761/… ?: :)

      – Shlomi Schwartz
      Nov 19 '18 at 14:30













    2












    2








    2







    Note: This answer assumes that your model in training phase is not stateful. You must understand what an stateful RNN layer is and make sure that the training data has the corresponding properties of statefulness. In short it means there is a dependency between the sequences, i.e. one sequence is the follow-up to another sequence, which you want to consider in your model. If your model and training data is stateful then I think other answers which involve setting stateful=True for the RNN layers from the beginning are simpler.



    Update: No matter the training model is stateful or not, you can always copy its weights to the inference model and enable statefulness. So I think solutions based on setting stateful=True are shorter and better than mine. Their only drawback is that the batch size in these solutions must be fixed.




    Note that the output of a LSTM layer over a single sequence is determined by its weight matrices, which are fixed, and its internal states which depends on the previous processed timestep. Now to get the output of LSTM layer for a single sequence of length m, one obvious way is to feed the entire sequence to the LSTM layer in one go. However, as I stated earlier, since its internal states depends on the previous timestep, we can exploit this fact and feed that single sequence chunk by chunk by getting the state of LSTM layer at the end of processing a chunk and pass it to the LSTM layer for processing the next chunk. To make it more clear, suppose the sequence length is 7 (i.e. it has 7 timesteps of fixed-length feature vectors). As an example, it is possible to process this sequence like this:



    1. Feed the timesteps 1 and 2 to the LSTM layer; get the final state (call it C1).

    2. Feed the timesteps 3, 4 and 5 and state C1 as the initial state to the LSTM layer; get the final state (call it C2).

    3. Feed the timesteps 6 and 7 and state C2 as the initial state to the LSTM layer; get the final output.

    That final output is equivalent to the output produced by the LSTM layer if we had feed it the entire 7 timesteps at once.



    So to realize this in Keras, you can set the return_state argument of LSTM layer to True so that you can get the intermediate state. Further, don't specify a fixed timestep length when defining the input layer. Instead use None to be able to feed the model with sequences of arbitrary length which enables us to process each sequence progressively (it's fine if your input data in training time are sequences of fixed-length).



    Since you need this chuck processing capability in inference time, we need to define a new model which shares the LSTM layer used in training model and can get the initial states as input and also gives the resulting states as output. The following is a general sketch of it could be done (note that the returned state of LSTM layer is not used when training the model, we only need it in test time):



    # define training model
    train_input = Input(shape=(None, n_feats)) # note that the number of timesteps is None
    lstm_layer = LSTM(n_units, return_state=True)
    lstm_output, _, _ = lstm_layer(train_input) # note that we ignore the returned states
    classifier = Dense(1, activation='sigmoid')
    train_output = classifier(lstm_output)

    train_model = Model(train_input, train_output)

    # compile and fit the model on training data ...

    # ==================================================

    # define inference model
    inf_input = Input(shape=(None, n_feats))
    state_h_input = Input(shape=(n_units,))
    state_c_input = Input(shape=(n_units,))

    # we use the layers of previous model
    lstm_output, state_h, state_c = lstm_layer(inf_input,
    initial_state=[state_h_input, state_c_input])
    output = classifier(lstm_output)

    inf_model = Model([inf_input, state_h_input, state_c_input],
    [output, state_h, state_c]) # note that we return the states as output


    Now you can feed the inf_model as much as the timesteps of a sequence are available right now. However, note that initially you must feed the states with vectors of all zeros (which is the default initial value of states). For example, if the sequence length is 7, a sketch of what happens when new data stream is available is as follows:



    state_h = np.zeros((1, n_units,))
    state_c = np.zeros((1, n_units))

    # three new timesteps are available
    outputs = inf_model.predict([timesteps, state_h, state_c])

    out = output[0,0] # you may ignore this output since the entire sequence has not been processed yet
    state_h = outputs[0,1]
    state_c = outputs[0,2]

    # after some time another four new timesteps are available
    outputs = inf_model.predict([timesteps, state_h, state_c])

    # we have processed 7 timesteps, so the output is valid
    out = output[0,0] # store it, pass it to another thread or do whatever you want to do with it

    # reinitialize the state to make them ready for the next sequence chunk
    state_h = np.zeros((1, n_units))
    state_c = np.zeros((1, n_units))

    # to be continued...


    Of course you need to do this in some kind of loop or implement a control flow structure to process the data stream, but I think you get what the general idea looks like.



    Finally, although your specific example is not a sequence-to-sequence model, but I highly recommend to read the official Keras seq2seq tutorial which I think one can learn a lot of ideas from it.






    share|improve this answer















    Note: This answer assumes that your model in training phase is not stateful. You must understand what an stateful RNN layer is and make sure that the training data has the corresponding properties of statefulness. In short it means there is a dependency between the sequences, i.e. one sequence is the follow-up to another sequence, which you want to consider in your model. If your model and training data is stateful then I think other answers which involve setting stateful=True for the RNN layers from the beginning are simpler.



    Update: No matter the training model is stateful or not, you can always copy its weights to the inference model and enable statefulness. So I think solutions based on setting stateful=True are shorter and better than mine. Their only drawback is that the batch size in these solutions must be fixed.




    Note that the output of a LSTM layer over a single sequence is determined by its weight matrices, which are fixed, and its internal states which depends on the previous processed timestep. Now to get the output of LSTM layer for a single sequence of length m, one obvious way is to feed the entire sequence to the LSTM layer in one go. However, as I stated earlier, since its internal states depends on the previous timestep, we can exploit this fact and feed that single sequence chunk by chunk by getting the state of LSTM layer at the end of processing a chunk and pass it to the LSTM layer for processing the next chunk. To make it more clear, suppose the sequence length is 7 (i.e. it has 7 timesteps of fixed-length feature vectors). As an example, it is possible to process this sequence like this:



    1. Feed the timesteps 1 and 2 to the LSTM layer; get the final state (call it C1).

    2. Feed the timesteps 3, 4 and 5 and state C1 as the initial state to the LSTM layer; get the final state (call it C2).

    3. Feed the timesteps 6 and 7 and state C2 as the initial state to the LSTM layer; get the final output.

    That final output is equivalent to the output produced by the LSTM layer if we had feed it the entire 7 timesteps at once.



    So to realize this in Keras, you can set the return_state argument of LSTM layer to True so that you can get the intermediate state. Further, don't specify a fixed timestep length when defining the input layer. Instead use None to be able to feed the model with sequences of arbitrary length which enables us to process each sequence progressively (it's fine if your input data in training time are sequences of fixed-length).



    Since you need this chuck processing capability in inference time, we need to define a new model which shares the LSTM layer used in training model and can get the initial states as input and also gives the resulting states as output. The following is a general sketch of it could be done (note that the returned state of LSTM layer is not used when training the model, we only need it in test time):



    # define training model
    train_input = Input(shape=(None, n_feats)) # note that the number of timesteps is None
    lstm_layer = LSTM(n_units, return_state=True)
    lstm_output, _, _ = lstm_layer(train_input) # note that we ignore the returned states
    classifier = Dense(1, activation='sigmoid')
    train_output = classifier(lstm_output)

    train_model = Model(train_input, train_output)

    # compile and fit the model on training data ...

    # ==================================================

    # define inference model
    inf_input = Input(shape=(None, n_feats))
    state_h_input = Input(shape=(n_units,))
    state_c_input = Input(shape=(n_units,))

    # we use the layers of previous model
    lstm_output, state_h, state_c = lstm_layer(inf_input,
    initial_state=[state_h_input, state_c_input])
    output = classifier(lstm_output)

    inf_model = Model([inf_input, state_h_input, state_c_input],
    [output, state_h, state_c]) # note that we return the states as output


    Now you can feed the inf_model as much as the timesteps of a sequence are available right now. However, note that initially you must feed the states with vectors of all zeros (which is the default initial value of states). For example, if the sequence length is 7, a sketch of what happens when new data stream is available is as follows:



    state_h = np.zeros((1, n_units,))
    state_c = np.zeros((1, n_units))

    # three new timesteps are available
    outputs = inf_model.predict([timesteps, state_h, state_c])

    out = output[0,0] # you may ignore this output since the entire sequence has not been processed yet
    state_h = outputs[0,1]
    state_c = outputs[0,2]

    # after some time another four new timesteps are available
    outputs = inf_model.predict([timesteps, state_h, state_c])

    # we have processed 7 timesteps, so the output is valid
    out = output[0,0] # store it, pass it to another thread or do whatever you want to do with it

    # reinitialize the state to make them ready for the next sequence chunk
    state_h = np.zeros((1, n_units))
    state_c = np.zeros((1, n_units))

    # to be continued...


    Of course you need to do this in some kind of loop or implement a control flow structure to process the data stream, but I think you get what the general idea looks like.



    Finally, although your specific example is not a sequence-to-sequence model, but I highly recommend to read the official Keras seq2seq tutorial which I think one can learn a lot of ideas from it.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 15 '18 at 16:56

























    answered Nov 13 '18 at 16:04









    todaytoday

    11.1k22038




    11.1k22038












    • Thank you for your response it was very educational

      – Shlomi Schwartz
      Nov 14 '18 at 12:10











    • Could you please have a look here stackoverflow.com/questions/53376761/… ?: :)

      – Shlomi Schwartz
      Nov 19 '18 at 14:30

















    • Thank you for your response it was very educational

      – Shlomi Schwartz
      Nov 14 '18 at 12:10











    • Could you please have a look here stackoverflow.com/questions/53376761/… ?: :)

      – Shlomi Schwartz
      Nov 19 '18 at 14:30
















    Thank you for your response it was very educational

    – Shlomi Schwartz
    Nov 14 '18 at 12:10





    Thank you for your response it was very educational

    – Shlomi Schwartz
    Nov 14 '18 at 12:10













    Could you please have a look here stackoverflow.com/questions/53376761/… ?: :)

    – Shlomi Schwartz
    Nov 19 '18 at 14:30





    Could you please have a look here stackoverflow.com/questions/53376761/… ?: :)

    – Shlomi Schwartz
    Nov 19 '18 at 14:30











    0














    As far as I know, because of the static graph in Tensorflow, there is no efficient way to feed inputs with different length from the training input length.



    Padding is the official way to work around with that, but it is less efficient and memory consuming. I suggest you look into Pytorch, which will be trivial to fix your problem.



    There are a lot of great posts to build lstm with Pytorch, and you will understand the benefit of dynamic graph once you see them.






    share|improve this answer



























      0














      As far as I know, because of the static graph in Tensorflow, there is no efficient way to feed inputs with different length from the training input length.



      Padding is the official way to work around with that, but it is less efficient and memory consuming. I suggest you look into Pytorch, which will be trivial to fix your problem.



      There are a lot of great posts to build lstm with Pytorch, and you will understand the benefit of dynamic graph once you see them.






      share|improve this answer

























        0












        0








        0







        As far as I know, because of the static graph in Tensorflow, there is no efficient way to feed inputs with different length from the training input length.



        Padding is the official way to work around with that, but it is less efficient and memory consuming. I suggest you look into Pytorch, which will be trivial to fix your problem.



        There are a lot of great posts to build lstm with Pytorch, and you will understand the benefit of dynamic graph once you see them.






        share|improve this answer













        As far as I know, because of the static graph in Tensorflow, there is no efficient way to feed inputs with different length from the training input length.



        Padding is the official way to work around with that, but it is less efficient and memory consuming. I suggest you look into Pytorch, which will be trivial to fix your problem.



        There are a lot of great posts to build lstm with Pytorch, and you will understand the benefit of dynamic graph once you see them.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 16 '18 at 20:00









        ShawnShawn

        1345




        1345



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53190253%2fstateful-lstm-and-stream-predictions%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            這個網誌中的熱門文章

            How to read a connectionString WITH PROVIDER in .NET Core?

            In R, how to develop a multiplot heatmap.2 figure showing key labels successfully

            Museum of Modern and Contemporary Art of Trento and Rovereto