I'm getting a Memory Error while processing my dataset in python ? What could be the reason?









up vote
2
down vote

favorite












I'm trying a deep learning code for processing my dataset which consists 1,12,120 images. What my code does is the following:



import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import imageio
from os import listdir
import skimage.transform
import pickle
import sys, os
from sklearn.preprocessing import MultiLabelBinarizer

def get_labels(pic_id):
labels = meta_data.loc[meta_data["Image Index"]==pic_id,"Finding Labels"]
return labels.tolist()[0].split("|")
#Loading Data
meta_data = pd.read_csv(data_entry_path)
bbox_list = pd.read_csv(bbox_list_path)
with open(train_txt_path, "r") as f:
train_list = [ i.strip() for i in f.readlines()]
with open(valid_txt_path, "r") as f:
valid_list = [ i.strip() for i in f.readlines()]
label_eight = list(np.unique(bbox_list["Finding Label"])) + ["No Finding"]

# transform training images
print("training example:",len(train_list))
print("take care of your RAM here !!!")
train_X =
for i in range(len(train_list)):
image_path = os.path.join(image_folder_path,train_list[i])
img = imageio.imread(image_path)
if img.shape != (1024,1024): # there some image with shape (1024,1024,4) in training set
img = img[:,:,0]
img_resized = skimage.transform.resize(img,(256,256)) # or use img[::4] here
train_X.append((np.array(img_resized)/255).reshape(256,256,1))
if i % 3000==0:
print(i)
train_X = np.array(train_X)
np.save(os.path.join(data_path,"train_X_small.npy"), train_X)

# transform validation images
print("validation example:",len(valid_list))
valid_X =
for i in range(len(valid_list)):
image_path = os.path.join(image_folder_path,valid_list[i])
img = imageio.imread(image_path)
if img.shape != (1024,1024):
img = img[:,:,0]
img_resized = skimage.transform.resize(img,(256,256))
# if img.shape != (1024,1024):
# train_X.append(img[:,:,0])
# else:
valid_X.append((np.array(img_resized)/255).reshape(256,256,1))
if i % 3000==0:
print(i)

valid_X = np.array(valid_X)
np.save(os.path.join(data_path,"valid_X_small.npy"), valid_X)


# process label
print("label preprocessing")

train_y =
for train_id in train_list:
train_y.append(get_labels(train_id))
valid_y =
for valid_id in valid_list:
valid_y.append(get_labels(valid_id))


encoder = MultiLabelBinarizer()
encoder.fit(train_y+valid_y)
train_y_onehot = encoder.transform(train_y)
valid_y_onehot = encoder.transform(valid_y)
train_y_onehot = np.delete(train_y_onehot, [2,3,5,6,7,10,12],1) # delete out 8 and "No Finding" column
valid_y_onehot = np.delete(valid_y_onehot, [2,3,5,6,7,10,12],1) # delete out 8 and "No Finding" column

with open(data_path + "/train_y_onehot.pkl","wb") as f:
pickle.dump(train_y_onehot, f)
with open(data_path + "/valid_y_onehot.pkl","wb") as f:
pickle.dump(valid_y_onehot, f)
with open(data_path + "/label_encoder.pkl","wb") as f:
pickle.dump(encoder, f)


So this is my code My system configration:Intel i7-7700HQ,16GB Ram,256GB ssd,GTX 1050 4GB



Is there a way to split my dataset so and write to the same file again? I'm also posting the error which i got as a screenshot Error From Powershell after executing the code for 30mins



I'm also using python3 in my system 64bit version



Does spliting the 1,12,120 images and taking them as batches will it work here? If yes how?










share|improve this question



















  • 1




    A back of the envelope calculation says 78,000 * 256 * 256 * 3 = 15.5gb, so it seems that you're running out of memory right on schedule. There are options for dealing with this but they depend on what your overall goal is.
    – CJ59
    Nov 10 at 13:31











  • Is there anyway to overcome this ?
    – Amrudesh Balakrishnan
    Nov 10 at 13:41










  • Fix your formatting, and post a smaller proof of concept. stackoverflow.com/help/mcve
    – Jim Stewart
    Nov 10 at 13:44










  • This is the best I can do sir this is the same way how I wrote the code
    – Amrudesh Balakrishnan
    Nov 10 at 13:48










  • The trivial answer is to downsample your images more or to use fewer images. You could preprocess all your data, save it to a new file and memory map that file (but it would almost certainly work terribly). The real answer is that this is approaching an HPC scale problem.
    – CJ59
    Nov 10 at 13:50















up vote
2
down vote

favorite












I'm trying a deep learning code for processing my dataset which consists 1,12,120 images. What my code does is the following:



import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import imageio
from os import listdir
import skimage.transform
import pickle
import sys, os
from sklearn.preprocessing import MultiLabelBinarizer

def get_labels(pic_id):
labels = meta_data.loc[meta_data["Image Index"]==pic_id,"Finding Labels"]
return labels.tolist()[0].split("|")
#Loading Data
meta_data = pd.read_csv(data_entry_path)
bbox_list = pd.read_csv(bbox_list_path)
with open(train_txt_path, "r") as f:
train_list = [ i.strip() for i in f.readlines()]
with open(valid_txt_path, "r") as f:
valid_list = [ i.strip() for i in f.readlines()]
label_eight = list(np.unique(bbox_list["Finding Label"])) + ["No Finding"]

# transform training images
print("training example:",len(train_list))
print("take care of your RAM here !!!")
train_X =
for i in range(len(train_list)):
image_path = os.path.join(image_folder_path,train_list[i])
img = imageio.imread(image_path)
if img.shape != (1024,1024): # there some image with shape (1024,1024,4) in training set
img = img[:,:,0]
img_resized = skimage.transform.resize(img,(256,256)) # or use img[::4] here
train_X.append((np.array(img_resized)/255).reshape(256,256,1))
if i % 3000==0:
print(i)
train_X = np.array(train_X)
np.save(os.path.join(data_path,"train_X_small.npy"), train_X)

# transform validation images
print("validation example:",len(valid_list))
valid_X =
for i in range(len(valid_list)):
image_path = os.path.join(image_folder_path,valid_list[i])
img = imageio.imread(image_path)
if img.shape != (1024,1024):
img = img[:,:,0]
img_resized = skimage.transform.resize(img,(256,256))
# if img.shape != (1024,1024):
# train_X.append(img[:,:,0])
# else:
valid_X.append((np.array(img_resized)/255).reshape(256,256,1))
if i % 3000==0:
print(i)

valid_X = np.array(valid_X)
np.save(os.path.join(data_path,"valid_X_small.npy"), valid_X)


# process label
print("label preprocessing")

train_y =
for train_id in train_list:
train_y.append(get_labels(train_id))
valid_y =
for valid_id in valid_list:
valid_y.append(get_labels(valid_id))


encoder = MultiLabelBinarizer()
encoder.fit(train_y+valid_y)
train_y_onehot = encoder.transform(train_y)
valid_y_onehot = encoder.transform(valid_y)
train_y_onehot = np.delete(train_y_onehot, [2,3,5,6,7,10,12],1) # delete out 8 and "No Finding" column
valid_y_onehot = np.delete(valid_y_onehot, [2,3,5,6,7,10,12],1) # delete out 8 and "No Finding" column

with open(data_path + "/train_y_onehot.pkl","wb") as f:
pickle.dump(train_y_onehot, f)
with open(data_path + "/valid_y_onehot.pkl","wb") as f:
pickle.dump(valid_y_onehot, f)
with open(data_path + "/label_encoder.pkl","wb") as f:
pickle.dump(encoder, f)


So this is my code My system configration:Intel i7-7700HQ,16GB Ram,256GB ssd,GTX 1050 4GB



Is there a way to split my dataset so and write to the same file again? I'm also posting the error which i got as a screenshot Error From Powershell after executing the code for 30mins



I'm also using python3 in my system 64bit version



Does spliting the 1,12,120 images and taking them as batches will it work here? If yes how?










share|improve this question



















  • 1




    A back of the envelope calculation says 78,000 * 256 * 256 * 3 = 15.5gb, so it seems that you're running out of memory right on schedule. There are options for dealing with this but they depend on what your overall goal is.
    – CJ59
    Nov 10 at 13:31











  • Is there anyway to overcome this ?
    – Amrudesh Balakrishnan
    Nov 10 at 13:41










  • Fix your formatting, and post a smaller proof of concept. stackoverflow.com/help/mcve
    – Jim Stewart
    Nov 10 at 13:44










  • This is the best I can do sir this is the same way how I wrote the code
    – Amrudesh Balakrishnan
    Nov 10 at 13:48










  • The trivial answer is to downsample your images more or to use fewer images. You could preprocess all your data, save it to a new file and memory map that file (but it would almost certainly work terribly). The real answer is that this is approaching an HPC scale problem.
    – CJ59
    Nov 10 at 13:50













up vote
2
down vote

favorite









up vote
2
down vote

favorite











I'm trying a deep learning code for processing my dataset which consists 1,12,120 images. What my code does is the following:



import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import imageio
from os import listdir
import skimage.transform
import pickle
import sys, os
from sklearn.preprocessing import MultiLabelBinarizer

def get_labels(pic_id):
labels = meta_data.loc[meta_data["Image Index"]==pic_id,"Finding Labels"]
return labels.tolist()[0].split("|")
#Loading Data
meta_data = pd.read_csv(data_entry_path)
bbox_list = pd.read_csv(bbox_list_path)
with open(train_txt_path, "r") as f:
train_list = [ i.strip() for i in f.readlines()]
with open(valid_txt_path, "r") as f:
valid_list = [ i.strip() for i in f.readlines()]
label_eight = list(np.unique(bbox_list["Finding Label"])) + ["No Finding"]

# transform training images
print("training example:",len(train_list))
print("take care of your RAM here !!!")
train_X =
for i in range(len(train_list)):
image_path = os.path.join(image_folder_path,train_list[i])
img = imageio.imread(image_path)
if img.shape != (1024,1024): # there some image with shape (1024,1024,4) in training set
img = img[:,:,0]
img_resized = skimage.transform.resize(img,(256,256)) # or use img[::4] here
train_X.append((np.array(img_resized)/255).reshape(256,256,1))
if i % 3000==0:
print(i)
train_X = np.array(train_X)
np.save(os.path.join(data_path,"train_X_small.npy"), train_X)

# transform validation images
print("validation example:",len(valid_list))
valid_X =
for i in range(len(valid_list)):
image_path = os.path.join(image_folder_path,valid_list[i])
img = imageio.imread(image_path)
if img.shape != (1024,1024):
img = img[:,:,0]
img_resized = skimage.transform.resize(img,(256,256))
# if img.shape != (1024,1024):
# train_X.append(img[:,:,0])
# else:
valid_X.append((np.array(img_resized)/255).reshape(256,256,1))
if i % 3000==0:
print(i)

valid_X = np.array(valid_X)
np.save(os.path.join(data_path,"valid_X_small.npy"), valid_X)


# process label
print("label preprocessing")

train_y =
for train_id in train_list:
train_y.append(get_labels(train_id))
valid_y =
for valid_id in valid_list:
valid_y.append(get_labels(valid_id))


encoder = MultiLabelBinarizer()
encoder.fit(train_y+valid_y)
train_y_onehot = encoder.transform(train_y)
valid_y_onehot = encoder.transform(valid_y)
train_y_onehot = np.delete(train_y_onehot, [2,3,5,6,7,10,12],1) # delete out 8 and "No Finding" column
valid_y_onehot = np.delete(valid_y_onehot, [2,3,5,6,7,10,12],1) # delete out 8 and "No Finding" column

with open(data_path + "/train_y_onehot.pkl","wb") as f:
pickle.dump(train_y_onehot, f)
with open(data_path + "/valid_y_onehot.pkl","wb") as f:
pickle.dump(valid_y_onehot, f)
with open(data_path + "/label_encoder.pkl","wb") as f:
pickle.dump(encoder, f)


So this is my code My system configration:Intel i7-7700HQ,16GB Ram,256GB ssd,GTX 1050 4GB



Is there a way to split my dataset so and write to the same file again? I'm also posting the error which i got as a screenshot Error From Powershell after executing the code for 30mins



I'm also using python3 in my system 64bit version



Does spliting the 1,12,120 images and taking them as batches will it work here? If yes how?










share|improve this question















I'm trying a deep learning code for processing my dataset which consists 1,12,120 images. What my code does is the following:



import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import imageio
from os import listdir
import skimage.transform
import pickle
import sys, os
from sklearn.preprocessing import MultiLabelBinarizer

def get_labels(pic_id):
labels = meta_data.loc[meta_data["Image Index"]==pic_id,"Finding Labels"]
return labels.tolist()[0].split("|")
#Loading Data
meta_data = pd.read_csv(data_entry_path)
bbox_list = pd.read_csv(bbox_list_path)
with open(train_txt_path, "r") as f:
train_list = [ i.strip() for i in f.readlines()]
with open(valid_txt_path, "r") as f:
valid_list = [ i.strip() for i in f.readlines()]
label_eight = list(np.unique(bbox_list["Finding Label"])) + ["No Finding"]

# transform training images
print("training example:",len(train_list))
print("take care of your RAM here !!!")
train_X =
for i in range(len(train_list)):
image_path = os.path.join(image_folder_path,train_list[i])
img = imageio.imread(image_path)
if img.shape != (1024,1024): # there some image with shape (1024,1024,4) in training set
img = img[:,:,0]
img_resized = skimage.transform.resize(img,(256,256)) # or use img[::4] here
train_X.append((np.array(img_resized)/255).reshape(256,256,1))
if i % 3000==0:
print(i)
train_X = np.array(train_X)
np.save(os.path.join(data_path,"train_X_small.npy"), train_X)

# transform validation images
print("validation example:",len(valid_list))
valid_X =
for i in range(len(valid_list)):
image_path = os.path.join(image_folder_path,valid_list[i])
img = imageio.imread(image_path)
if img.shape != (1024,1024):
img = img[:,:,0]
img_resized = skimage.transform.resize(img,(256,256))
# if img.shape != (1024,1024):
# train_X.append(img[:,:,0])
# else:
valid_X.append((np.array(img_resized)/255).reshape(256,256,1))
if i % 3000==0:
print(i)

valid_X = np.array(valid_X)
np.save(os.path.join(data_path,"valid_X_small.npy"), valid_X)


# process label
print("label preprocessing")

train_y =
for train_id in train_list:
train_y.append(get_labels(train_id))
valid_y =
for valid_id in valid_list:
valid_y.append(get_labels(valid_id))


encoder = MultiLabelBinarizer()
encoder.fit(train_y+valid_y)
train_y_onehot = encoder.transform(train_y)
valid_y_onehot = encoder.transform(valid_y)
train_y_onehot = np.delete(train_y_onehot, [2,3,5,6,7,10,12],1) # delete out 8 and "No Finding" column
valid_y_onehot = np.delete(valid_y_onehot, [2,3,5,6,7,10,12],1) # delete out 8 and "No Finding" column

with open(data_path + "/train_y_onehot.pkl","wb") as f:
pickle.dump(train_y_onehot, f)
with open(data_path + "/valid_y_onehot.pkl","wb") as f:
pickle.dump(valid_y_onehot, f)
with open(data_path + "/label_encoder.pkl","wb") as f:
pickle.dump(encoder, f)


So this is my code My system configration:Intel i7-7700HQ,16GB Ram,256GB ssd,GTX 1050 4GB



Is there a way to split my dataset so and write to the same file again? I'm also posting the error which i got as a screenshot Error From Powershell after executing the code for 30mins



I'm also using python3 in my system 64bit version



Does spliting the 1,12,120 images and taking them as batches will it work here? If yes how?







python python-3.x numpy deep-learning sklearn-pandas






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 10 at 19:40

























asked Nov 10 at 13:18









Amrudesh Balakrishnan

165




165







  • 1




    A back of the envelope calculation says 78,000 * 256 * 256 * 3 = 15.5gb, so it seems that you're running out of memory right on schedule. There are options for dealing with this but they depend on what your overall goal is.
    – CJ59
    Nov 10 at 13:31











  • Is there anyway to overcome this ?
    – Amrudesh Balakrishnan
    Nov 10 at 13:41










  • Fix your formatting, and post a smaller proof of concept. stackoverflow.com/help/mcve
    – Jim Stewart
    Nov 10 at 13:44










  • This is the best I can do sir this is the same way how I wrote the code
    – Amrudesh Balakrishnan
    Nov 10 at 13:48










  • The trivial answer is to downsample your images more or to use fewer images. You could preprocess all your data, save it to a new file and memory map that file (but it would almost certainly work terribly). The real answer is that this is approaching an HPC scale problem.
    – CJ59
    Nov 10 at 13:50













  • 1




    A back of the envelope calculation says 78,000 * 256 * 256 * 3 = 15.5gb, so it seems that you're running out of memory right on schedule. There are options for dealing with this but they depend on what your overall goal is.
    – CJ59
    Nov 10 at 13:31











  • Is there anyway to overcome this ?
    – Amrudesh Balakrishnan
    Nov 10 at 13:41










  • Fix your formatting, and post a smaller proof of concept. stackoverflow.com/help/mcve
    – Jim Stewart
    Nov 10 at 13:44










  • This is the best I can do sir this is the same way how I wrote the code
    – Amrudesh Balakrishnan
    Nov 10 at 13:48










  • The trivial answer is to downsample your images more or to use fewer images. You could preprocess all your data, save it to a new file and memory map that file (but it would almost certainly work terribly). The real answer is that this is approaching an HPC scale problem.
    – CJ59
    Nov 10 at 13:50








1




1




A back of the envelope calculation says 78,000 * 256 * 256 * 3 = 15.5gb, so it seems that you're running out of memory right on schedule. There are options for dealing with this but they depend on what your overall goal is.
– CJ59
Nov 10 at 13:31





A back of the envelope calculation says 78,000 * 256 * 256 * 3 = 15.5gb, so it seems that you're running out of memory right on schedule. There are options for dealing with this but they depend on what your overall goal is.
– CJ59
Nov 10 at 13:31













Is there anyway to overcome this ?
– Amrudesh Balakrishnan
Nov 10 at 13:41




Is there anyway to overcome this ?
– Amrudesh Balakrishnan
Nov 10 at 13:41












Fix your formatting, and post a smaller proof of concept. stackoverflow.com/help/mcve
– Jim Stewart
Nov 10 at 13:44




Fix your formatting, and post a smaller proof of concept. stackoverflow.com/help/mcve
– Jim Stewart
Nov 10 at 13:44












This is the best I can do sir this is the same way how I wrote the code
– Amrudesh Balakrishnan
Nov 10 at 13:48




This is the best I can do sir this is the same way how I wrote the code
– Amrudesh Balakrishnan
Nov 10 at 13:48












The trivial answer is to downsample your images more or to use fewer images. You could preprocess all your data, save it to a new file and memory map that file (but it would almost certainly work terribly). The real answer is that this is approaching an HPC scale problem.
– CJ59
Nov 10 at 13:50





The trivial answer is to downsample your images more or to use fewer images. You could preprocess all your data, save it to a new file and memory map that file (but it would almost certainly work terribly). The real answer is that this is approaching an HPC scale problem.
– CJ59
Nov 10 at 13:50


















active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53239342%2fim-getting-a-memory-error-while-processing-my-dataset-in-python-what-could-be%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown






























active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes















 

draft saved


draft discarded















































 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53239342%2fim-getting-a-memory-error-while-processing-my-dataset-in-python-what-could-be%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

How to read a connectionString WITH PROVIDER in .NET Core?

Node.js Script on GitHub Pages or Amazon S3

Museum of Modern and Contemporary Art of Trento and Rovereto