why the performance different if change the order of compile and findall in python










2














I notice that preprocessing by compiling pattern will speed up the match operation, just like the following example.



python3 -m timeit -s "import re; t = re.compile(r'[w+][d]+')" "t.findall('abc eft123&aaa123')"


1000000 loops, best of 3: 1.42 usec per loop



python3 -m timeit -s "import re;" "re.findall(r'[w+][d]+', 'abc eft123&aaa123')"


100000 loops, best of 3: 2.45 usec per loop



But if I change the order of compiled pattern and re module, the result different, it seems that much slower now, why this happened?



python3 -m timeit -s "import re; t = re.compile(r'[w+][d]+')" "re.findall(t, 'abc eft123&aaa123')"


100000 loops, best of 3: 3.66 usec per loop










share|improve this question




























    2














    I notice that preprocessing by compiling pattern will speed up the match operation, just like the following example.



    python3 -m timeit -s "import re; t = re.compile(r'[w+][d]+')" "t.findall('abc eft123&aaa123')"


    1000000 loops, best of 3: 1.42 usec per loop



    python3 -m timeit -s "import re;" "re.findall(r'[w+][d]+', 'abc eft123&aaa123')"


    100000 loops, best of 3: 2.45 usec per loop



    But if I change the order of compiled pattern and re module, the result different, it seems that much slower now, why this happened?



    python3 -m timeit -s "import re; t = re.compile(r'[w+][d]+')" "re.findall(t, 'abc eft123&aaa123')"


    100000 loops, best of 3: 3.66 usec per loop










    share|improve this question


























      2












      2








      2







      I notice that preprocessing by compiling pattern will speed up the match operation, just like the following example.



      python3 -m timeit -s "import re; t = re.compile(r'[w+][d]+')" "t.findall('abc eft123&aaa123')"


      1000000 loops, best of 3: 1.42 usec per loop



      python3 -m timeit -s "import re;" "re.findall(r'[w+][d]+', 'abc eft123&aaa123')"


      100000 loops, best of 3: 2.45 usec per loop



      But if I change the order of compiled pattern and re module, the result different, it seems that much slower now, why this happened?



      python3 -m timeit -s "import re; t = re.compile(r'[w+][d]+')" "re.findall(t, 'abc eft123&aaa123')"


      100000 loops, best of 3: 3.66 usec per loop










      share|improve this question















      I notice that preprocessing by compiling pattern will speed up the match operation, just like the following example.



      python3 -m timeit -s "import re; t = re.compile(r'[w+][d]+')" "t.findall('abc eft123&aaa123')"


      1000000 loops, best of 3: 1.42 usec per loop



      python3 -m timeit -s "import re;" "re.findall(r'[w+][d]+', 'abc eft123&aaa123')"


      100000 loops, best of 3: 2.45 usec per loop



      But if I change the order of compiled pattern and re module, the result different, it seems that much slower now, why this happened?



      python3 -m timeit -s "import re; t = re.compile(r'[w+][d]+')" "re.findall(t, 'abc eft123&aaa123')"


      100000 loops, best of 3: 3.66 usec per loop







      python regex compilation findall






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 12 at 8:36

























      asked Nov 12 at 8:00









      colin-zhou

      1115




      1115






















          3 Answers
          3






          active

          oldest

          votes


















          1














          By "changing the order" you are actually using findall in its "static" form, pretty much the equivallent of calling str.lower('ABC') instead of 'ABC'.lower().



          Depending on the exact implementation of the Python interpreter you are using, this is probably causing some overhead (for method lookups for example).



          In other words, this is more related to the way Python works and not specifically to regex or the re module in particular.



          from timeit import Timer

          def a():
          str.lower('ABC')

          def b():
          'ABC'.lower()

          print(min(Timer(a).repeat(5000, 5000)))
          print(min(Timer(b).repeat(5000, 5000)))


          Outputs



          0.001060427000000086 # str.lower('ABC')
          0.0008686820000001205 # 'ABC'.lower()





          share|improve this answer






















          • Thanks for your reply.
            – colin-zhou
            Nov 12 at 8:12










          • I did a test as you described, the result shows it may not root cause. python3 -m timeit -s "'AAAAAAAAAAAAABBBBBBBBBBBBBBBA'.lower()" 100000000 loops, best of 3: 0.00995 usec per loop python3 -m timeit -s "str.lower('AAAAAAAAAAAAABBBBBBBBBBBBBBBA')" 100000000 loops, best of 3: 0.00979 usec per loop
            – colin-zhou
            Nov 12 at 8:18


















          0














          Let's say that word1, word2 ... are regexes:



          let's rewrite those parts:



          allWords = [re.compile(m) for m in ["word1", "word2", "word3"]]


          I would create one single regex for all patterns:



          allWords = re.compile("|".join(["word1", "word2", "word3"])


          To support regexes with | in them, you would have to parenthesize the expressions:



          allWords = re.compile("|".join("()".format(x) for x in ["word1", "word2", "word3"])


          (that also works with standard words of course, and it's still worth using regexes because of the | part)



          now this is a disguised loop with each term hardcoded:



          def bar(data, allWords):
          if allWords[0].search(data) != None:
          temp = data.split("word1", 1)[1] # that works only on non-regexes BTW
          return(temp)

          elif allWords[1].search(data) != None:
          temp = data.split("word2", 1)[1]
          return(temp)


          can be rewritten simply as



          def bar(data, allWords):
          return allWords.split(data,maxsplit=1)[1]


          in terms of performance:



          Regular expression is compiled at start, so it's as fast as it can be
          there's no loop or pasted expressions, the "or" part is done by the regex engine, which is most of the time some compiled code: can't beat that in pure python.
          The match & the split are done in one operation
          The last hiccup is that internally the regex engine searches for all expressions in a loop, which makes that a O(n) algorithm. To make it faster, you would have to predict which pattern is the most frequent, and put it first (my hypothesis is that regexes are "disjoint", which means that a text cannot be matched by several ones, else the longest would have to come before the shorter one)






          share|improve this answer






























            0














            I took some time to investigate the realization of re.findall and re.match, and I copied the standard library source code here.



            def findall(pattern, string, flags=0):
            """Return a list of all non-overlapping matches in the string.

            If one or more capturing groups are present in the pattern, return
            a list of groups; this will be a list of tuples if the pattern
            has more than one group.

            Empty matches are included in the result."""
            return _compile(pattern, flags).findall(string)


            def match(pattern, string, flags=0):
            """Try to apply the pattern at the start of the string, returning
            a match object, or None if no match was found."""
            return _compile(pattern, flags).match(string)


            def _compile(pattern, flags):
            # internal: compile pattern
            try:
            p, loc = _cache[type(pattern), pattern, flags]
            if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
            return p
            except KeyError:
            pass
            if isinstance(pattern, _pattern_type):
            if flags:
            raise ValueError(
            "cannot process flags argument with a compiled pattern")
            return pattern
            if not sre_compile.isstring(pattern):
            raise TypeError("first argument must be string or compiled pattern")
            p = sre_compile.compile(pattern, flags)
            if not (flags & DEBUG):
            if len(_cache) >= _MAXCACHE:
            _cache.clear()
            if p.flags & LOCALE:
            if not _locale:
            return p
            loc = _locale.setlocale(_locale.LC_CTYPE)
            else:
            loc = None
            _cache[type(pattern), pattern, flags] = p, loc
            return p


            This shows that if we execute re.findall(compiled_pattern, string) directly, it will trigger an additional calling of _compile(pattern, flags), in which function it will do some check and search the pattern in cache dictionary. However, if we call compile_pattern.findall(string) instead, that 'additional operation' wouldn't exist. So compile_pattern.findall(string) will faster than re.findall(compile_pattern, string)






            share|improve this answer




















              Your Answer






              StackExchange.ifUsing("editor", function ()
              StackExchange.using("externalEditor", function ()
              StackExchange.using("snippets", function ()
              StackExchange.snippets.init();
              );
              );
              , "code-snippets");

              StackExchange.ready(function()
              var channelOptions =
              tags: "".split(" "),
              id: "1"
              ;
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function()
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled)
              StackExchange.using("snippets", function()
              createEditor();
              );

              else
              createEditor();

              );

              function createEditor()
              StackExchange.prepareEditor(
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader:
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              ,
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              );



              );













              draft saved

              draft discarded


















              StackExchange.ready(
              function ()
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53257939%2fwhy-the-performance-different-if-change-the-order-of-compile-and-findall-in-pyth%23new-answer', 'question_page');

              );

              Post as a guest















              Required, but never shown

























              3 Answers
              3






              active

              oldest

              votes








              3 Answers
              3






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              1














              By "changing the order" you are actually using findall in its "static" form, pretty much the equivallent of calling str.lower('ABC') instead of 'ABC'.lower().



              Depending on the exact implementation of the Python interpreter you are using, this is probably causing some overhead (for method lookups for example).



              In other words, this is more related to the way Python works and not specifically to regex or the re module in particular.



              from timeit import Timer

              def a():
              str.lower('ABC')

              def b():
              'ABC'.lower()

              print(min(Timer(a).repeat(5000, 5000)))
              print(min(Timer(b).repeat(5000, 5000)))


              Outputs



              0.001060427000000086 # str.lower('ABC')
              0.0008686820000001205 # 'ABC'.lower()





              share|improve this answer






















              • Thanks for your reply.
                – colin-zhou
                Nov 12 at 8:12










              • I did a test as you described, the result shows it may not root cause. python3 -m timeit -s "'AAAAAAAAAAAAABBBBBBBBBBBBBBBA'.lower()" 100000000 loops, best of 3: 0.00995 usec per loop python3 -m timeit -s "str.lower('AAAAAAAAAAAAABBBBBBBBBBBBBBBA')" 100000000 loops, best of 3: 0.00979 usec per loop
                – colin-zhou
                Nov 12 at 8:18















              1














              By "changing the order" you are actually using findall in its "static" form, pretty much the equivallent of calling str.lower('ABC') instead of 'ABC'.lower().



              Depending on the exact implementation of the Python interpreter you are using, this is probably causing some overhead (for method lookups for example).



              In other words, this is more related to the way Python works and not specifically to regex or the re module in particular.



              from timeit import Timer

              def a():
              str.lower('ABC')

              def b():
              'ABC'.lower()

              print(min(Timer(a).repeat(5000, 5000)))
              print(min(Timer(b).repeat(5000, 5000)))


              Outputs



              0.001060427000000086 # str.lower('ABC')
              0.0008686820000001205 # 'ABC'.lower()





              share|improve this answer






















              • Thanks for your reply.
                – colin-zhou
                Nov 12 at 8:12










              • I did a test as you described, the result shows it may not root cause. python3 -m timeit -s "'AAAAAAAAAAAAABBBBBBBBBBBBBBBA'.lower()" 100000000 loops, best of 3: 0.00995 usec per loop python3 -m timeit -s "str.lower('AAAAAAAAAAAAABBBBBBBBBBBBBBBA')" 100000000 loops, best of 3: 0.00979 usec per loop
                – colin-zhou
                Nov 12 at 8:18













              1












              1








              1






              By "changing the order" you are actually using findall in its "static" form, pretty much the equivallent of calling str.lower('ABC') instead of 'ABC'.lower().



              Depending on the exact implementation of the Python interpreter you are using, this is probably causing some overhead (for method lookups for example).



              In other words, this is more related to the way Python works and not specifically to regex or the re module in particular.



              from timeit import Timer

              def a():
              str.lower('ABC')

              def b():
              'ABC'.lower()

              print(min(Timer(a).repeat(5000, 5000)))
              print(min(Timer(b).repeat(5000, 5000)))


              Outputs



              0.001060427000000086 # str.lower('ABC')
              0.0008686820000001205 # 'ABC'.lower()





              share|improve this answer














              By "changing the order" you are actually using findall in its "static" form, pretty much the equivallent of calling str.lower('ABC') instead of 'ABC'.lower().



              Depending on the exact implementation of the Python interpreter you are using, this is probably causing some overhead (for method lookups for example).



              In other words, this is more related to the way Python works and not specifically to regex or the re module in particular.



              from timeit import Timer

              def a():
              str.lower('ABC')

              def b():
              'ABC'.lower()

              print(min(Timer(a).repeat(5000, 5000)))
              print(min(Timer(b).repeat(5000, 5000)))


              Outputs



              0.001060427000000086 # str.lower('ABC')
              0.0008686820000001205 # 'ABC'.lower()






              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Nov 12 at 8:13

























              answered Nov 12 at 8:07









              DeepSpace

              36.2k44168




              36.2k44168











              • Thanks for your reply.
                – colin-zhou
                Nov 12 at 8:12










              • I did a test as you described, the result shows it may not root cause. python3 -m timeit -s "'AAAAAAAAAAAAABBBBBBBBBBBBBBBA'.lower()" 100000000 loops, best of 3: 0.00995 usec per loop python3 -m timeit -s "str.lower('AAAAAAAAAAAAABBBBBBBBBBBBBBBA')" 100000000 loops, best of 3: 0.00979 usec per loop
                – colin-zhou
                Nov 12 at 8:18
















              • Thanks for your reply.
                – colin-zhou
                Nov 12 at 8:12










              • I did a test as you described, the result shows it may not root cause. python3 -m timeit -s "'AAAAAAAAAAAAABBBBBBBBBBBBBBBA'.lower()" 100000000 loops, best of 3: 0.00995 usec per loop python3 -m timeit -s "str.lower('AAAAAAAAAAAAABBBBBBBBBBBBBBBA')" 100000000 loops, best of 3: 0.00979 usec per loop
                – colin-zhou
                Nov 12 at 8:18















              Thanks for your reply.
              – colin-zhou
              Nov 12 at 8:12




              Thanks for your reply.
              – colin-zhou
              Nov 12 at 8:12












              I did a test as you described, the result shows it may not root cause. python3 -m timeit -s "'AAAAAAAAAAAAABBBBBBBBBBBBBBBA'.lower()" 100000000 loops, best of 3: 0.00995 usec per loop python3 -m timeit -s "str.lower('AAAAAAAAAAAAABBBBBBBBBBBBBBBA')" 100000000 loops, best of 3: 0.00979 usec per loop
              – colin-zhou
              Nov 12 at 8:18




              I did a test as you described, the result shows it may not root cause. python3 -m timeit -s "'AAAAAAAAAAAAABBBBBBBBBBBBBBBA'.lower()" 100000000 loops, best of 3: 0.00995 usec per loop python3 -m timeit -s "str.lower('AAAAAAAAAAAAABBBBBBBBBBBBBBBA')" 100000000 loops, best of 3: 0.00979 usec per loop
              – colin-zhou
              Nov 12 at 8:18













              0














              Let's say that word1, word2 ... are regexes:



              let's rewrite those parts:



              allWords = [re.compile(m) for m in ["word1", "word2", "word3"]]


              I would create one single regex for all patterns:



              allWords = re.compile("|".join(["word1", "word2", "word3"])


              To support regexes with | in them, you would have to parenthesize the expressions:



              allWords = re.compile("|".join("()".format(x) for x in ["word1", "word2", "word3"])


              (that also works with standard words of course, and it's still worth using regexes because of the | part)



              now this is a disguised loop with each term hardcoded:



              def bar(data, allWords):
              if allWords[0].search(data) != None:
              temp = data.split("word1", 1)[1] # that works only on non-regexes BTW
              return(temp)

              elif allWords[1].search(data) != None:
              temp = data.split("word2", 1)[1]
              return(temp)


              can be rewritten simply as



              def bar(data, allWords):
              return allWords.split(data,maxsplit=1)[1]


              in terms of performance:



              Regular expression is compiled at start, so it's as fast as it can be
              there's no loop or pasted expressions, the "or" part is done by the regex engine, which is most of the time some compiled code: can't beat that in pure python.
              The match & the split are done in one operation
              The last hiccup is that internally the regex engine searches for all expressions in a loop, which makes that a O(n) algorithm. To make it faster, you would have to predict which pattern is the most frequent, and put it first (my hypothesis is that regexes are "disjoint", which means that a text cannot be matched by several ones, else the longest would have to come before the shorter one)






              share|improve this answer



























                0














                Let's say that word1, word2 ... are regexes:



                let's rewrite those parts:



                allWords = [re.compile(m) for m in ["word1", "word2", "word3"]]


                I would create one single regex for all patterns:



                allWords = re.compile("|".join(["word1", "word2", "word3"])


                To support regexes with | in them, you would have to parenthesize the expressions:



                allWords = re.compile("|".join("()".format(x) for x in ["word1", "word2", "word3"])


                (that also works with standard words of course, and it's still worth using regexes because of the | part)



                now this is a disguised loop with each term hardcoded:



                def bar(data, allWords):
                if allWords[0].search(data) != None:
                temp = data.split("word1", 1)[1] # that works only on non-regexes BTW
                return(temp)

                elif allWords[1].search(data) != None:
                temp = data.split("word2", 1)[1]
                return(temp)


                can be rewritten simply as



                def bar(data, allWords):
                return allWords.split(data,maxsplit=1)[1]


                in terms of performance:



                Regular expression is compiled at start, so it's as fast as it can be
                there's no loop or pasted expressions, the "or" part is done by the regex engine, which is most of the time some compiled code: can't beat that in pure python.
                The match & the split are done in one operation
                The last hiccup is that internally the regex engine searches for all expressions in a loop, which makes that a O(n) algorithm. To make it faster, you would have to predict which pattern is the most frequent, and put it first (my hypothesis is that regexes are "disjoint", which means that a text cannot be matched by several ones, else the longest would have to come before the shorter one)






                share|improve this answer

























                  0












                  0








                  0






                  Let's say that word1, word2 ... are regexes:



                  let's rewrite those parts:



                  allWords = [re.compile(m) for m in ["word1", "word2", "word3"]]


                  I would create one single regex for all patterns:



                  allWords = re.compile("|".join(["word1", "word2", "word3"])


                  To support regexes with | in them, you would have to parenthesize the expressions:



                  allWords = re.compile("|".join("()".format(x) for x in ["word1", "word2", "word3"])


                  (that also works with standard words of course, and it's still worth using regexes because of the | part)



                  now this is a disguised loop with each term hardcoded:



                  def bar(data, allWords):
                  if allWords[0].search(data) != None:
                  temp = data.split("word1", 1)[1] # that works only on non-regexes BTW
                  return(temp)

                  elif allWords[1].search(data) != None:
                  temp = data.split("word2", 1)[1]
                  return(temp)


                  can be rewritten simply as



                  def bar(data, allWords):
                  return allWords.split(data,maxsplit=1)[1]


                  in terms of performance:



                  Regular expression is compiled at start, so it's as fast as it can be
                  there's no loop or pasted expressions, the "or" part is done by the regex engine, which is most of the time some compiled code: can't beat that in pure python.
                  The match & the split are done in one operation
                  The last hiccup is that internally the regex engine searches for all expressions in a loop, which makes that a O(n) algorithm. To make it faster, you would have to predict which pattern is the most frequent, and put it first (my hypothesis is that regexes are "disjoint", which means that a text cannot be matched by several ones, else the longest would have to come before the shorter one)






                  share|improve this answer














                  Let's say that word1, word2 ... are regexes:



                  let's rewrite those parts:



                  allWords = [re.compile(m) for m in ["word1", "word2", "word3"]]


                  I would create one single regex for all patterns:



                  allWords = re.compile("|".join(["word1", "word2", "word3"])


                  To support regexes with | in them, you would have to parenthesize the expressions:



                  allWords = re.compile("|".join("()".format(x) for x in ["word1", "word2", "word3"])


                  (that also works with standard words of course, and it's still worth using regexes because of the | part)



                  now this is a disguised loop with each term hardcoded:



                  def bar(data, allWords):
                  if allWords[0].search(data) != None:
                  temp = data.split("word1", 1)[1] # that works only on non-regexes BTW
                  return(temp)

                  elif allWords[1].search(data) != None:
                  temp = data.split("word2", 1)[1]
                  return(temp)


                  can be rewritten simply as



                  def bar(data, allWords):
                  return allWords.split(data,maxsplit=1)[1]


                  in terms of performance:



                  Regular expression is compiled at start, so it's as fast as it can be
                  there's no loop or pasted expressions, the "or" part is done by the regex engine, which is most of the time some compiled code: can't beat that in pure python.
                  The match & the split are done in one operation
                  The last hiccup is that internally the regex engine searches for all expressions in a loop, which makes that a O(n) algorithm. To make it faster, you would have to predict which pattern is the most frequent, and put it first (my hypothesis is that regexes are "disjoint", which means that a text cannot be matched by several ones, else the longest would have to come before the shorter one)







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Nov 12 at 9:10









                  Adrian W

                  1,77131320




                  1,77131320










                  answered Nov 12 at 8:09









                  prasanth ashok

                  1




                  1





















                      0














                      I took some time to investigate the realization of re.findall and re.match, and I copied the standard library source code here.



                      def findall(pattern, string, flags=0):
                      """Return a list of all non-overlapping matches in the string.

                      If one or more capturing groups are present in the pattern, return
                      a list of groups; this will be a list of tuples if the pattern
                      has more than one group.

                      Empty matches are included in the result."""
                      return _compile(pattern, flags).findall(string)


                      def match(pattern, string, flags=0):
                      """Try to apply the pattern at the start of the string, returning
                      a match object, or None if no match was found."""
                      return _compile(pattern, flags).match(string)


                      def _compile(pattern, flags):
                      # internal: compile pattern
                      try:
                      p, loc = _cache[type(pattern), pattern, flags]
                      if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
                      return p
                      except KeyError:
                      pass
                      if isinstance(pattern, _pattern_type):
                      if flags:
                      raise ValueError(
                      "cannot process flags argument with a compiled pattern")
                      return pattern
                      if not sre_compile.isstring(pattern):
                      raise TypeError("first argument must be string or compiled pattern")
                      p = sre_compile.compile(pattern, flags)
                      if not (flags & DEBUG):
                      if len(_cache) >= _MAXCACHE:
                      _cache.clear()
                      if p.flags & LOCALE:
                      if not _locale:
                      return p
                      loc = _locale.setlocale(_locale.LC_CTYPE)
                      else:
                      loc = None
                      _cache[type(pattern), pattern, flags] = p, loc
                      return p


                      This shows that if we execute re.findall(compiled_pattern, string) directly, it will trigger an additional calling of _compile(pattern, flags), in which function it will do some check and search the pattern in cache dictionary. However, if we call compile_pattern.findall(string) instead, that 'additional operation' wouldn't exist. So compile_pattern.findall(string) will faster than re.findall(compile_pattern, string)






                      share|improve this answer

























                        0














                        I took some time to investigate the realization of re.findall and re.match, and I copied the standard library source code here.



                        def findall(pattern, string, flags=0):
                        """Return a list of all non-overlapping matches in the string.

                        If one or more capturing groups are present in the pattern, return
                        a list of groups; this will be a list of tuples if the pattern
                        has more than one group.

                        Empty matches are included in the result."""
                        return _compile(pattern, flags).findall(string)


                        def match(pattern, string, flags=0):
                        """Try to apply the pattern at the start of the string, returning
                        a match object, or None if no match was found."""
                        return _compile(pattern, flags).match(string)


                        def _compile(pattern, flags):
                        # internal: compile pattern
                        try:
                        p, loc = _cache[type(pattern), pattern, flags]
                        if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
                        return p
                        except KeyError:
                        pass
                        if isinstance(pattern, _pattern_type):
                        if flags:
                        raise ValueError(
                        "cannot process flags argument with a compiled pattern")
                        return pattern
                        if not sre_compile.isstring(pattern):
                        raise TypeError("first argument must be string or compiled pattern")
                        p = sre_compile.compile(pattern, flags)
                        if not (flags & DEBUG):
                        if len(_cache) >= _MAXCACHE:
                        _cache.clear()
                        if p.flags & LOCALE:
                        if not _locale:
                        return p
                        loc = _locale.setlocale(_locale.LC_CTYPE)
                        else:
                        loc = None
                        _cache[type(pattern), pattern, flags] = p, loc
                        return p


                        This shows that if we execute re.findall(compiled_pattern, string) directly, it will trigger an additional calling of _compile(pattern, flags), in which function it will do some check and search the pattern in cache dictionary. However, if we call compile_pattern.findall(string) instead, that 'additional operation' wouldn't exist. So compile_pattern.findall(string) will faster than re.findall(compile_pattern, string)






                        share|improve this answer























                          0












                          0








                          0






                          I took some time to investigate the realization of re.findall and re.match, and I copied the standard library source code here.



                          def findall(pattern, string, flags=0):
                          """Return a list of all non-overlapping matches in the string.

                          If one or more capturing groups are present in the pattern, return
                          a list of groups; this will be a list of tuples if the pattern
                          has more than one group.

                          Empty matches are included in the result."""
                          return _compile(pattern, flags).findall(string)


                          def match(pattern, string, flags=0):
                          """Try to apply the pattern at the start of the string, returning
                          a match object, or None if no match was found."""
                          return _compile(pattern, flags).match(string)


                          def _compile(pattern, flags):
                          # internal: compile pattern
                          try:
                          p, loc = _cache[type(pattern), pattern, flags]
                          if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
                          return p
                          except KeyError:
                          pass
                          if isinstance(pattern, _pattern_type):
                          if flags:
                          raise ValueError(
                          "cannot process flags argument with a compiled pattern")
                          return pattern
                          if not sre_compile.isstring(pattern):
                          raise TypeError("first argument must be string or compiled pattern")
                          p = sre_compile.compile(pattern, flags)
                          if not (flags & DEBUG):
                          if len(_cache) >= _MAXCACHE:
                          _cache.clear()
                          if p.flags & LOCALE:
                          if not _locale:
                          return p
                          loc = _locale.setlocale(_locale.LC_CTYPE)
                          else:
                          loc = None
                          _cache[type(pattern), pattern, flags] = p, loc
                          return p


                          This shows that if we execute re.findall(compiled_pattern, string) directly, it will trigger an additional calling of _compile(pattern, flags), in which function it will do some check and search the pattern in cache dictionary. However, if we call compile_pattern.findall(string) instead, that 'additional operation' wouldn't exist. So compile_pattern.findall(string) will faster than re.findall(compile_pattern, string)






                          share|improve this answer












                          I took some time to investigate the realization of re.findall and re.match, and I copied the standard library source code here.



                          def findall(pattern, string, flags=0):
                          """Return a list of all non-overlapping matches in the string.

                          If one or more capturing groups are present in the pattern, return
                          a list of groups; this will be a list of tuples if the pattern
                          has more than one group.

                          Empty matches are included in the result."""
                          return _compile(pattern, flags).findall(string)


                          def match(pattern, string, flags=0):
                          """Try to apply the pattern at the start of the string, returning
                          a match object, or None if no match was found."""
                          return _compile(pattern, flags).match(string)


                          def _compile(pattern, flags):
                          # internal: compile pattern
                          try:
                          p, loc = _cache[type(pattern), pattern, flags]
                          if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
                          return p
                          except KeyError:
                          pass
                          if isinstance(pattern, _pattern_type):
                          if flags:
                          raise ValueError(
                          "cannot process flags argument with a compiled pattern")
                          return pattern
                          if not sre_compile.isstring(pattern):
                          raise TypeError("first argument must be string or compiled pattern")
                          p = sre_compile.compile(pattern, flags)
                          if not (flags & DEBUG):
                          if len(_cache) >= _MAXCACHE:
                          _cache.clear()
                          if p.flags & LOCALE:
                          if not _locale:
                          return p
                          loc = _locale.setlocale(_locale.LC_CTYPE)
                          else:
                          loc = None
                          _cache[type(pattern), pattern, flags] = p, loc
                          return p


                          This shows that if we execute re.findall(compiled_pattern, string) directly, it will trigger an additional calling of _compile(pattern, flags), in which function it will do some check and search the pattern in cache dictionary. However, if we call compile_pattern.findall(string) instead, that 'additional operation' wouldn't exist. So compile_pattern.findall(string) will faster than re.findall(compile_pattern, string)







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Nov 12 at 18:26









                          colin-zhou

                          1115




                          1115



























                              draft saved

                              draft discarded
















































                              Thanks for contributing an answer to Stack Overflow!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid


                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.

                              To learn more, see our tips on writing great answers.





                              Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                              Please pay close attention to the following guidance:


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid


                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.

                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function ()
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53257939%2fwhy-the-performance-different-if-change-the-order-of-compile-and-findall-in-pyth%23new-answer', 'question_page');

                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              這個網誌中的熱門文章

                              How to read a connectionString WITH PROVIDER in .NET Core?

                              In R, how to develop a multiplot heatmap.2 figure showing key labels successfully

                              Museum of Modern and Contemporary Art of Trento and Rovereto