Solr server keeps going down while indexing (millions of docs) using Pysolr










0















I've been trying to index a lot of documents on Solr (~200 million docs). I use Pysolr to do the indexing. However, the Solr server keeps going down while indexing (sometimes after 100 million documents have been indexed, sometimes after ~180 million documents, it differs).
I'm not sure why this is happening, is it because of the open size limit, i.e., related to the warning I get while starting the server with bin/solr start?




* [WARN] * Your open file limit is currently 1024. It should be set to 65000 to avoid operational disruption.




I used multiprocessing while indexing with chunks of 25000 (but I also tried with bigger chunks and without multiprocessing and it still crashed). Is it because there are too many requests being sent to Solr? My Python code is below.



solr = pysolr.Solr('http://localhost:8983/solr/collection_name', always_commit=True)

def insert_into_solr(filepath):
""" Inserts records into an empty solr index which has already been created."""
record_number = 0
list_for_solr =
with open(filepath, "r") as file:
csv_reader = csv.reader((line.replace('', '') for line in file), delimiter='t', quoting=csv.QUOTE_NONE)
for paper_id, paper_reference_id, context in csv_reader:
# int, int, string
record_number += 1
solr_record =
solr_record['paper_id'] = paper_id
solr_record['reference_id'] = reference_id
solr_record['context'] = context
# Chunks of 25000
if record_number % 25000 == 0:
list_for_solr.append(solr_record)
try:
solr.add(list_for_solr)
except Exception as e:
print(e, record_number, filepath)
list_for_solr =
print(record_number)
else:
list_for_solr.append(solr_record)
try:
solr.add(list_for_solr)
except Exception as e:
print(e, record_number, filepath)

def create_concurrent_futures():
""" Uses all the cores to do the parsing and inserting"""
folderpath = '.../'
refs_files = glob(os.path.join(folderpath, '*.txt'))
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.map(insert_into_solr, refs_files, chunksize=1)

if __name__ == '__main__':
create_concurrent_futures()


I read somewhere that the standard Solr installation has a hard limit of around 2.14 billion documents. Is it better to use Solrcloud (which I have never configured) when there are 100s of millions of docs? Will it help with this problem? (I also have another file with 1.4 Billion documents which needs to be indexed after this). I have only one server, is there any point trying to configure Solrcloud?










share|improve this question

















  • 1





    An easy test is to change the ulimit and see if it helps - see File handles and processes - ulimit settings for information. Using SolrCloud is helpful when you want to spread the set of documents across multiple servers. The 2.1b limit is per shard / core, so using a collection in SolrCloud with multiple servers (even if they're running on a single machine but with different working directories) will allow you to scale that up further.

    – MatsLindh
    Nov 15 '18 at 20:20











  • Thanks @MatsLindh . I wanted to find out before asking the the sysadmin to increase the ulimit. I setup Solr using the method in 'Getting started' (i.e., I just extracted it) rather than following the process in the 'Take Solr to production' page). I suppose using default configs might also be contributing to this issue? So from what I understand, its probably best to configure and use Solrcloud when there are so many documents, isn't it?

    – ash
    Nov 15 '18 at 21:10







  • 1





    That depends. It can be - if the amount of queries or total number of documents requires it. It's also easier to scale in the future if necessary, but for prototyping and hosting something for data exploration with a small number of users, it's probably not.

    – MatsLindh
    Nov 16 '18 at 9:14











  • Thanks @MatsLindh for the advice. That's very helpful.

    – ash
    Nov 17 '18 at 2:40















0















I've been trying to index a lot of documents on Solr (~200 million docs). I use Pysolr to do the indexing. However, the Solr server keeps going down while indexing (sometimes after 100 million documents have been indexed, sometimes after ~180 million documents, it differs).
I'm not sure why this is happening, is it because of the open size limit, i.e., related to the warning I get while starting the server with bin/solr start?




* [WARN] * Your open file limit is currently 1024. It should be set to 65000 to avoid operational disruption.




I used multiprocessing while indexing with chunks of 25000 (but I also tried with bigger chunks and without multiprocessing and it still crashed). Is it because there are too many requests being sent to Solr? My Python code is below.



solr = pysolr.Solr('http://localhost:8983/solr/collection_name', always_commit=True)

def insert_into_solr(filepath):
""" Inserts records into an empty solr index which has already been created."""
record_number = 0
list_for_solr =
with open(filepath, "r") as file:
csv_reader = csv.reader((line.replace('', '') for line in file), delimiter='t', quoting=csv.QUOTE_NONE)
for paper_id, paper_reference_id, context in csv_reader:
# int, int, string
record_number += 1
solr_record =
solr_record['paper_id'] = paper_id
solr_record['reference_id'] = reference_id
solr_record['context'] = context
# Chunks of 25000
if record_number % 25000 == 0:
list_for_solr.append(solr_record)
try:
solr.add(list_for_solr)
except Exception as e:
print(e, record_number, filepath)
list_for_solr =
print(record_number)
else:
list_for_solr.append(solr_record)
try:
solr.add(list_for_solr)
except Exception as e:
print(e, record_number, filepath)

def create_concurrent_futures():
""" Uses all the cores to do the parsing and inserting"""
folderpath = '.../'
refs_files = glob(os.path.join(folderpath, '*.txt'))
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.map(insert_into_solr, refs_files, chunksize=1)

if __name__ == '__main__':
create_concurrent_futures()


I read somewhere that the standard Solr installation has a hard limit of around 2.14 billion documents. Is it better to use Solrcloud (which I have never configured) when there are 100s of millions of docs? Will it help with this problem? (I also have another file with 1.4 Billion documents which needs to be indexed after this). I have only one server, is there any point trying to configure Solrcloud?










share|improve this question

















  • 1





    An easy test is to change the ulimit and see if it helps - see File handles and processes - ulimit settings for information. Using SolrCloud is helpful when you want to spread the set of documents across multiple servers. The 2.1b limit is per shard / core, so using a collection in SolrCloud with multiple servers (even if they're running on a single machine but with different working directories) will allow you to scale that up further.

    – MatsLindh
    Nov 15 '18 at 20:20











  • Thanks @MatsLindh . I wanted to find out before asking the the sysadmin to increase the ulimit. I setup Solr using the method in 'Getting started' (i.e., I just extracted it) rather than following the process in the 'Take Solr to production' page). I suppose using default configs might also be contributing to this issue? So from what I understand, its probably best to configure and use Solrcloud when there are so many documents, isn't it?

    – ash
    Nov 15 '18 at 21:10







  • 1





    That depends. It can be - if the amount of queries or total number of documents requires it. It's also easier to scale in the future if necessary, but for prototyping and hosting something for data exploration with a small number of users, it's probably not.

    – MatsLindh
    Nov 16 '18 at 9:14











  • Thanks @MatsLindh for the advice. That's very helpful.

    – ash
    Nov 17 '18 at 2:40













0












0








0








I've been trying to index a lot of documents on Solr (~200 million docs). I use Pysolr to do the indexing. However, the Solr server keeps going down while indexing (sometimes after 100 million documents have been indexed, sometimes after ~180 million documents, it differs).
I'm not sure why this is happening, is it because of the open size limit, i.e., related to the warning I get while starting the server with bin/solr start?




* [WARN] * Your open file limit is currently 1024. It should be set to 65000 to avoid operational disruption.




I used multiprocessing while indexing with chunks of 25000 (but I also tried with bigger chunks and without multiprocessing and it still crashed). Is it because there are too many requests being sent to Solr? My Python code is below.



solr = pysolr.Solr('http://localhost:8983/solr/collection_name', always_commit=True)

def insert_into_solr(filepath):
""" Inserts records into an empty solr index which has already been created."""
record_number = 0
list_for_solr =
with open(filepath, "r") as file:
csv_reader = csv.reader((line.replace('', '') for line in file), delimiter='t', quoting=csv.QUOTE_NONE)
for paper_id, paper_reference_id, context in csv_reader:
# int, int, string
record_number += 1
solr_record =
solr_record['paper_id'] = paper_id
solr_record['reference_id'] = reference_id
solr_record['context'] = context
# Chunks of 25000
if record_number % 25000 == 0:
list_for_solr.append(solr_record)
try:
solr.add(list_for_solr)
except Exception as e:
print(e, record_number, filepath)
list_for_solr =
print(record_number)
else:
list_for_solr.append(solr_record)
try:
solr.add(list_for_solr)
except Exception as e:
print(e, record_number, filepath)

def create_concurrent_futures():
""" Uses all the cores to do the parsing and inserting"""
folderpath = '.../'
refs_files = glob(os.path.join(folderpath, '*.txt'))
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.map(insert_into_solr, refs_files, chunksize=1)

if __name__ == '__main__':
create_concurrent_futures()


I read somewhere that the standard Solr installation has a hard limit of around 2.14 billion documents. Is it better to use Solrcloud (which I have never configured) when there are 100s of millions of docs? Will it help with this problem? (I also have another file with 1.4 Billion documents which needs to be indexed after this). I have only one server, is there any point trying to configure Solrcloud?










share|improve this question














I've been trying to index a lot of documents on Solr (~200 million docs). I use Pysolr to do the indexing. However, the Solr server keeps going down while indexing (sometimes after 100 million documents have been indexed, sometimes after ~180 million documents, it differs).
I'm not sure why this is happening, is it because of the open size limit, i.e., related to the warning I get while starting the server with bin/solr start?




* [WARN] * Your open file limit is currently 1024. It should be set to 65000 to avoid operational disruption.




I used multiprocessing while indexing with chunks of 25000 (but I also tried with bigger chunks and without multiprocessing and it still crashed). Is it because there are too many requests being sent to Solr? My Python code is below.



solr = pysolr.Solr('http://localhost:8983/solr/collection_name', always_commit=True)

def insert_into_solr(filepath):
""" Inserts records into an empty solr index which has already been created."""
record_number = 0
list_for_solr =
with open(filepath, "r") as file:
csv_reader = csv.reader((line.replace('', '') for line in file), delimiter='t', quoting=csv.QUOTE_NONE)
for paper_id, paper_reference_id, context in csv_reader:
# int, int, string
record_number += 1
solr_record =
solr_record['paper_id'] = paper_id
solr_record['reference_id'] = reference_id
solr_record['context'] = context
# Chunks of 25000
if record_number % 25000 == 0:
list_for_solr.append(solr_record)
try:
solr.add(list_for_solr)
except Exception as e:
print(e, record_number, filepath)
list_for_solr =
print(record_number)
else:
list_for_solr.append(solr_record)
try:
solr.add(list_for_solr)
except Exception as e:
print(e, record_number, filepath)

def create_concurrent_futures():
""" Uses all the cores to do the parsing and inserting"""
folderpath = '.../'
refs_files = glob(os.path.join(folderpath, '*.txt'))
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.map(insert_into_solr, refs_files, chunksize=1)

if __name__ == '__main__':
create_concurrent_futures()


I read somewhere that the standard Solr installation has a hard limit of around 2.14 billion documents. Is it better to use Solrcloud (which I have never configured) when there are 100s of millions of docs? Will it help with this problem? (I also have another file with 1.4 Billion documents which needs to be indexed after this). I have only one server, is there any point trying to configure Solrcloud?







python ubuntu unix solr pysolr






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 15 '18 at 19:34









ashash

1616




1616







  • 1





    An easy test is to change the ulimit and see if it helps - see File handles and processes - ulimit settings for information. Using SolrCloud is helpful when you want to spread the set of documents across multiple servers. The 2.1b limit is per shard / core, so using a collection in SolrCloud with multiple servers (even if they're running on a single machine but with different working directories) will allow you to scale that up further.

    – MatsLindh
    Nov 15 '18 at 20:20











  • Thanks @MatsLindh . I wanted to find out before asking the the sysadmin to increase the ulimit. I setup Solr using the method in 'Getting started' (i.e., I just extracted it) rather than following the process in the 'Take Solr to production' page). I suppose using default configs might also be contributing to this issue? So from what I understand, its probably best to configure and use Solrcloud when there are so many documents, isn't it?

    – ash
    Nov 15 '18 at 21:10







  • 1





    That depends. It can be - if the amount of queries or total number of documents requires it. It's also easier to scale in the future if necessary, but for prototyping and hosting something for data exploration with a small number of users, it's probably not.

    – MatsLindh
    Nov 16 '18 at 9:14











  • Thanks @MatsLindh for the advice. That's very helpful.

    – ash
    Nov 17 '18 at 2:40












  • 1





    An easy test is to change the ulimit and see if it helps - see File handles and processes - ulimit settings for information. Using SolrCloud is helpful when you want to spread the set of documents across multiple servers. The 2.1b limit is per shard / core, so using a collection in SolrCloud with multiple servers (even if they're running on a single machine but with different working directories) will allow you to scale that up further.

    – MatsLindh
    Nov 15 '18 at 20:20











  • Thanks @MatsLindh . I wanted to find out before asking the the sysadmin to increase the ulimit. I setup Solr using the method in 'Getting started' (i.e., I just extracted it) rather than following the process in the 'Take Solr to production' page). I suppose using default configs might also be contributing to this issue? So from what I understand, its probably best to configure and use Solrcloud when there are so many documents, isn't it?

    – ash
    Nov 15 '18 at 21:10







  • 1





    That depends. It can be - if the amount of queries or total number of documents requires it. It's also easier to scale in the future if necessary, but for prototyping and hosting something for data exploration with a small number of users, it's probably not.

    – MatsLindh
    Nov 16 '18 at 9:14











  • Thanks @MatsLindh for the advice. That's very helpful.

    – ash
    Nov 17 '18 at 2:40







1




1





An easy test is to change the ulimit and see if it helps - see File handles and processes - ulimit settings for information. Using SolrCloud is helpful when you want to spread the set of documents across multiple servers. The 2.1b limit is per shard / core, so using a collection in SolrCloud with multiple servers (even if they're running on a single machine but with different working directories) will allow you to scale that up further.

– MatsLindh
Nov 15 '18 at 20:20





An easy test is to change the ulimit and see if it helps - see File handles and processes - ulimit settings for information. Using SolrCloud is helpful when you want to spread the set of documents across multiple servers. The 2.1b limit is per shard / core, so using a collection in SolrCloud with multiple servers (even if they're running on a single machine but with different working directories) will allow you to scale that up further.

– MatsLindh
Nov 15 '18 at 20:20













Thanks @MatsLindh . I wanted to find out before asking the the sysadmin to increase the ulimit. I setup Solr using the method in 'Getting started' (i.e., I just extracted it) rather than following the process in the 'Take Solr to production' page). I suppose using default configs might also be contributing to this issue? So from what I understand, its probably best to configure and use Solrcloud when there are so many documents, isn't it?

– ash
Nov 15 '18 at 21:10






Thanks @MatsLindh . I wanted to find out before asking the the sysadmin to increase the ulimit. I setup Solr using the method in 'Getting started' (i.e., I just extracted it) rather than following the process in the 'Take Solr to production' page). I suppose using default configs might also be contributing to this issue? So from what I understand, its probably best to configure and use Solrcloud when there are so many documents, isn't it?

– ash
Nov 15 '18 at 21:10





1




1





That depends. It can be - if the amount of queries or total number of documents requires it. It's also easier to scale in the future if necessary, but for prototyping and hosting something for data exploration with a small number of users, it's probably not.

– MatsLindh
Nov 16 '18 at 9:14





That depends. It can be - if the amount of queries or total number of documents requires it. It's also easier to scale in the future if necessary, but for prototyping and hosting something for data exploration with a small number of users, it's probably not.

– MatsLindh
Nov 16 '18 at 9:14













Thanks @MatsLindh for the advice. That's very helpful.

– ash
Nov 17 '18 at 2:40





Thanks @MatsLindh for the advice. That's very helpful.

– ash
Nov 17 '18 at 2:40












0






active

oldest

votes












Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53326747%2fsolr-server-keeps-going-down-while-indexing-millions-of-docs-using-pysolr%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53326747%2fsolr-server-keeps-going-down-while-indexing-millions-of-docs-using-pysolr%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

How to read a connectionString WITH PROVIDER in .NET Core?

In R, how to develop a multiplot heatmap.2 figure showing key labels successfully

Museum of Modern and Contemporary Art of Trento and Rovereto