Finding a bit pattern in a binary file using Python and memory map
I am processing a binary file that is not byte aligned at the start. Shortly in the file there is a 24 bit pattern 0xfaf330 that is a sync marker that marks subsequent byte aligned data. I am using Python mmap on the file and desire to use Python memoryview once the marker is found to process the remaining part of the file. So, how do I find the 24 bit pattern and then use mmap and memoryview from that point forward?
python-3.x binaryfiles
add a comment |
I am processing a binary file that is not byte aligned at the start. Shortly in the file there is a 24 bit pattern 0xfaf330 that is a sync marker that marks subsequent byte aligned data. I am using Python mmap on the file and desire to use Python memoryview once the marker is found to process the remaining part of the file. So, how do I find the 24 bit pattern and then use mmap and memoryview from that point forward?
python-3.x binaryfiles
Is there are reason why you mmap the file and don't just open and stream it?
– MisterMiyagi
Nov 15 '18 at 12:51
The file is very large and memory mapping helps to manage it.
– GAF
Nov 15 '18 at 12:55
Usingopen
will only buffer a portion of the file at any time. Do you need random access? Your description sounds ideal for stream processing.
– MisterMiyagi
Nov 15 '18 at 12:57
Subsequently, memoryview helps to process the remaining byte aligned data in chunks based on the file format specification.
– GAF
Nov 15 '18 at 12:57
The data read is subject to Python's regular garbage collection. Unless you hang on to it, it is reclaimed.
– MisterMiyagi
Nov 15 '18 at 13:22
add a comment |
I am processing a binary file that is not byte aligned at the start. Shortly in the file there is a 24 bit pattern 0xfaf330 that is a sync marker that marks subsequent byte aligned data. I am using Python mmap on the file and desire to use Python memoryview once the marker is found to process the remaining part of the file. So, how do I find the 24 bit pattern and then use mmap and memoryview from that point forward?
python-3.x binaryfiles
I am processing a binary file that is not byte aligned at the start. Shortly in the file there is a 24 bit pattern 0xfaf330 that is a sync marker that marks subsequent byte aligned data. I am using Python mmap on the file and desire to use Python memoryview once the marker is found to process the remaining part of the file. So, how do I find the 24 bit pattern and then use mmap and memoryview from that point forward?
python-3.x binaryfiles
python-3.x binaryfiles
edited Nov 15 '18 at 12:52
GAF
asked Nov 15 '18 at 12:49
GAFGAF
7624
7624
Is there are reason why you mmap the file and don't just open and stream it?
– MisterMiyagi
Nov 15 '18 at 12:51
The file is very large and memory mapping helps to manage it.
– GAF
Nov 15 '18 at 12:55
Usingopen
will only buffer a portion of the file at any time. Do you need random access? Your description sounds ideal for stream processing.
– MisterMiyagi
Nov 15 '18 at 12:57
Subsequently, memoryview helps to process the remaining byte aligned data in chunks based on the file format specification.
– GAF
Nov 15 '18 at 12:57
The data read is subject to Python's regular garbage collection. Unless you hang on to it, it is reclaimed.
– MisterMiyagi
Nov 15 '18 at 13:22
add a comment |
Is there are reason why you mmap the file and don't just open and stream it?
– MisterMiyagi
Nov 15 '18 at 12:51
The file is very large and memory mapping helps to manage it.
– GAF
Nov 15 '18 at 12:55
Usingopen
will only buffer a portion of the file at any time. Do you need random access? Your description sounds ideal for stream processing.
– MisterMiyagi
Nov 15 '18 at 12:57
Subsequently, memoryview helps to process the remaining byte aligned data in chunks based on the file format specification.
– GAF
Nov 15 '18 at 12:57
The data read is subject to Python's regular garbage collection. Unless you hang on to it, it is reclaimed.
– MisterMiyagi
Nov 15 '18 at 13:22
Is there are reason why you mmap the file and don't just open and stream it?
– MisterMiyagi
Nov 15 '18 at 12:51
Is there are reason why you mmap the file and don't just open and stream it?
– MisterMiyagi
Nov 15 '18 at 12:51
The file is very large and memory mapping helps to manage it.
– GAF
Nov 15 '18 at 12:55
The file is very large and memory mapping helps to manage it.
– GAF
Nov 15 '18 at 12:55
Using
open
will only buffer a portion of the file at any time. Do you need random access? Your description sounds ideal for stream processing.– MisterMiyagi
Nov 15 '18 at 12:57
Using
open
will only buffer a portion of the file at any time. Do you need random access? Your description sounds ideal for stream processing.– MisterMiyagi
Nov 15 '18 at 12:57
Subsequently, memoryview helps to process the remaining byte aligned data in chunks based on the file format specification.
– GAF
Nov 15 '18 at 12:57
Subsequently, memoryview helps to process the remaining byte aligned data in chunks based on the file format specification.
– GAF
Nov 15 '18 at 12:57
The data read is subject to Python's regular garbage collection. Unless you hang on to it, it is reclaimed.
– MisterMiyagi
Nov 15 '18 at 13:22
The data read is subject to Python's regular garbage collection. Unless you hang on to it, it is reclaimed.
– MisterMiyagi
Nov 15 '18 at 13:22
add a comment |
2 Answers
2
active
oldest
votes
If you do not need random access, you can use open
to stream the file. Using file.read
, you can get consecutive bytes from the file. If your file were byte-aligned, you could directly search through it:
in_stream = open('/dev/urandom', 'rb')
# discard individual bytes until first marker byte
while in_stream.peek(1) != b'xfaxf3x30':
in_stream.read(1)
# in_stream is now positioned directly after the marker
print(in_stream.tell())
By default, open
uses a small read buffer but never loads the entire file. You can stream through the file using further in_stream.read
calls.
Alternatively, you can use the result of in_stream.tell()
to jump to the correct position in an mmap'ed file.
Searching non-aligned bits
To manage non-byte aligned data, you must sift through bytes manually: bit-shifting allows to inspect sub-ranges of bytes. Note that Python only allows bit-shifting int
, not bytes
.
>>> pattern = 0xfaf330
>>> bin((pattern << 4) + 0b1011) # pattern shifted by 4 plus garbage
0b1111101011110011001100001011
You can use this to scan a window of bytes:
def find_bits(pattern: int, window: int, n: int):
"""Find an n-byte bit pattern in an n+1-byte window and return the offset"""
for offset in range(8):
window_slice = (window >> offset) & (2 ** (n*8) -1)
if pattern == window_slice:
return offset
raise IndexError('pattern not in window')
You can again use this to scan the file stream:
in_stream = open('/dev/urandom', 'rb')
# discard individual bytes until first marker byte
while True:
try:
offset = find_bits(
0xfaf330,
int.from_bytes(in_stream.peek(3)[:4], 'big'),
3
)
except IndexError:
in_stream.read(1)
else:
break
# in_stream is now positioned directly after the marker
print('byte-offset:', in_stream.tell(), 'bit-offset:', offset)
Alternatively, you can use binary representation to literally find the pattern in the window. Note that you have to mind padding of zero bits, so it is about the same work.
Reading non-aligned bits
Once you have the bit-offset, you can read-and-align data from the file. Basically, read one byte more than you need, then shift as needed:
def align_read(file, num_bytes: int, bit_offset: int):
if bit_offset == 0:
return file.read(num_bytes)
window = file.peek(num_bytes + 1)[:num_bytes + 1]
file.read(num_bytes)
data = (int.from_bytes(window, 'big') >> bit_offset) & (2 ** (num_bytes*8) - 1)
return data.to_bytes(num_bytes, 'big')
This will not work because the beginning of the file is not byte aligned. Meaning that I could read several bytes and come to the sync marker but read just a few bits of it. A subsequent read of one byte would read another misaligned part of the sync marker. Therefore, the marker could be read and not recognized. Thanks for your suggestion.
– GAF
Nov 15 '18 at 13:23
@GAF Sorry, missed that one. AFAIK Python does not support a resolution smaller than bytes - neither for open normmap
nor other means. You will have to bit-shift each chunk.
– MisterMiyagi
Nov 15 '18 at 13:33
Thanks. Appreciate the follow up.
– GAF
Nov 15 '18 at 14:15
@GAF Added a (working) draft how to handle the shifting to find the offset and re-align data. This is probably worth using Cython if your file is large and you read only small chunks at a time.
– MisterMiyagi
Nov 15 '18 at 14:29
add a comment |
MisterMiyagi's answer is a good solution. Another solution uses the bitstring module.
aFile = open(someFilePath, 'rb')
aBinaryStream = bitstring.ConstBitStream(aFile)
aTuple = aBinaryStream.find('0b111110101111001100100000') #the sync marker
If found, the position in the file is moved to the found location. Then you can read byte aligned data.
aBuffer = aBinaryStream.read('bytes:1024') # to read 1024 bytes
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53319850%2ffinding-a-bit-pattern-in-a-binary-file-using-python-and-memory-map%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
If you do not need random access, you can use open
to stream the file. Using file.read
, you can get consecutive bytes from the file. If your file were byte-aligned, you could directly search through it:
in_stream = open('/dev/urandom', 'rb')
# discard individual bytes until first marker byte
while in_stream.peek(1) != b'xfaxf3x30':
in_stream.read(1)
# in_stream is now positioned directly after the marker
print(in_stream.tell())
By default, open
uses a small read buffer but never loads the entire file. You can stream through the file using further in_stream.read
calls.
Alternatively, you can use the result of in_stream.tell()
to jump to the correct position in an mmap'ed file.
Searching non-aligned bits
To manage non-byte aligned data, you must sift through bytes manually: bit-shifting allows to inspect sub-ranges of bytes. Note that Python only allows bit-shifting int
, not bytes
.
>>> pattern = 0xfaf330
>>> bin((pattern << 4) + 0b1011) # pattern shifted by 4 plus garbage
0b1111101011110011001100001011
You can use this to scan a window of bytes:
def find_bits(pattern: int, window: int, n: int):
"""Find an n-byte bit pattern in an n+1-byte window and return the offset"""
for offset in range(8):
window_slice = (window >> offset) & (2 ** (n*8) -1)
if pattern == window_slice:
return offset
raise IndexError('pattern not in window')
You can again use this to scan the file stream:
in_stream = open('/dev/urandom', 'rb')
# discard individual bytes until first marker byte
while True:
try:
offset = find_bits(
0xfaf330,
int.from_bytes(in_stream.peek(3)[:4], 'big'),
3
)
except IndexError:
in_stream.read(1)
else:
break
# in_stream is now positioned directly after the marker
print('byte-offset:', in_stream.tell(), 'bit-offset:', offset)
Alternatively, you can use binary representation to literally find the pattern in the window. Note that you have to mind padding of zero bits, so it is about the same work.
Reading non-aligned bits
Once you have the bit-offset, you can read-and-align data from the file. Basically, read one byte more than you need, then shift as needed:
def align_read(file, num_bytes: int, bit_offset: int):
if bit_offset == 0:
return file.read(num_bytes)
window = file.peek(num_bytes + 1)[:num_bytes + 1]
file.read(num_bytes)
data = (int.from_bytes(window, 'big') >> bit_offset) & (2 ** (num_bytes*8) - 1)
return data.to_bytes(num_bytes, 'big')
This will not work because the beginning of the file is not byte aligned. Meaning that I could read several bytes and come to the sync marker but read just a few bits of it. A subsequent read of one byte would read another misaligned part of the sync marker. Therefore, the marker could be read and not recognized. Thanks for your suggestion.
– GAF
Nov 15 '18 at 13:23
@GAF Sorry, missed that one. AFAIK Python does not support a resolution smaller than bytes - neither for open normmap
nor other means. You will have to bit-shift each chunk.
– MisterMiyagi
Nov 15 '18 at 13:33
Thanks. Appreciate the follow up.
– GAF
Nov 15 '18 at 14:15
@GAF Added a (working) draft how to handle the shifting to find the offset and re-align data. This is probably worth using Cython if your file is large and you read only small chunks at a time.
– MisterMiyagi
Nov 15 '18 at 14:29
add a comment |
If you do not need random access, you can use open
to stream the file. Using file.read
, you can get consecutive bytes from the file. If your file were byte-aligned, you could directly search through it:
in_stream = open('/dev/urandom', 'rb')
# discard individual bytes until first marker byte
while in_stream.peek(1) != b'xfaxf3x30':
in_stream.read(1)
# in_stream is now positioned directly after the marker
print(in_stream.tell())
By default, open
uses a small read buffer but never loads the entire file. You can stream through the file using further in_stream.read
calls.
Alternatively, you can use the result of in_stream.tell()
to jump to the correct position in an mmap'ed file.
Searching non-aligned bits
To manage non-byte aligned data, you must sift through bytes manually: bit-shifting allows to inspect sub-ranges of bytes. Note that Python only allows bit-shifting int
, not bytes
.
>>> pattern = 0xfaf330
>>> bin((pattern << 4) + 0b1011) # pattern shifted by 4 plus garbage
0b1111101011110011001100001011
You can use this to scan a window of bytes:
def find_bits(pattern: int, window: int, n: int):
"""Find an n-byte bit pattern in an n+1-byte window and return the offset"""
for offset in range(8):
window_slice = (window >> offset) & (2 ** (n*8) -1)
if pattern == window_slice:
return offset
raise IndexError('pattern not in window')
You can again use this to scan the file stream:
in_stream = open('/dev/urandom', 'rb')
# discard individual bytes until first marker byte
while True:
try:
offset = find_bits(
0xfaf330,
int.from_bytes(in_stream.peek(3)[:4], 'big'),
3
)
except IndexError:
in_stream.read(1)
else:
break
# in_stream is now positioned directly after the marker
print('byte-offset:', in_stream.tell(), 'bit-offset:', offset)
Alternatively, you can use binary representation to literally find the pattern in the window. Note that you have to mind padding of zero bits, so it is about the same work.
Reading non-aligned bits
Once you have the bit-offset, you can read-and-align data from the file. Basically, read one byte more than you need, then shift as needed:
def align_read(file, num_bytes: int, bit_offset: int):
if bit_offset == 0:
return file.read(num_bytes)
window = file.peek(num_bytes + 1)[:num_bytes + 1]
file.read(num_bytes)
data = (int.from_bytes(window, 'big') >> bit_offset) & (2 ** (num_bytes*8) - 1)
return data.to_bytes(num_bytes, 'big')
This will not work because the beginning of the file is not byte aligned. Meaning that I could read several bytes and come to the sync marker but read just a few bits of it. A subsequent read of one byte would read another misaligned part of the sync marker. Therefore, the marker could be read and not recognized. Thanks for your suggestion.
– GAF
Nov 15 '18 at 13:23
@GAF Sorry, missed that one. AFAIK Python does not support a resolution smaller than bytes - neither for open normmap
nor other means. You will have to bit-shift each chunk.
– MisterMiyagi
Nov 15 '18 at 13:33
Thanks. Appreciate the follow up.
– GAF
Nov 15 '18 at 14:15
@GAF Added a (working) draft how to handle the shifting to find the offset and re-align data. This is probably worth using Cython if your file is large and you read only small chunks at a time.
– MisterMiyagi
Nov 15 '18 at 14:29
add a comment |
If you do not need random access, you can use open
to stream the file. Using file.read
, you can get consecutive bytes from the file. If your file were byte-aligned, you could directly search through it:
in_stream = open('/dev/urandom', 'rb')
# discard individual bytes until first marker byte
while in_stream.peek(1) != b'xfaxf3x30':
in_stream.read(1)
# in_stream is now positioned directly after the marker
print(in_stream.tell())
By default, open
uses a small read buffer but never loads the entire file. You can stream through the file using further in_stream.read
calls.
Alternatively, you can use the result of in_stream.tell()
to jump to the correct position in an mmap'ed file.
Searching non-aligned bits
To manage non-byte aligned data, you must sift through bytes manually: bit-shifting allows to inspect sub-ranges of bytes. Note that Python only allows bit-shifting int
, not bytes
.
>>> pattern = 0xfaf330
>>> bin((pattern << 4) + 0b1011) # pattern shifted by 4 plus garbage
0b1111101011110011001100001011
You can use this to scan a window of bytes:
def find_bits(pattern: int, window: int, n: int):
"""Find an n-byte bit pattern in an n+1-byte window and return the offset"""
for offset in range(8):
window_slice = (window >> offset) & (2 ** (n*8) -1)
if pattern == window_slice:
return offset
raise IndexError('pattern not in window')
You can again use this to scan the file stream:
in_stream = open('/dev/urandom', 'rb')
# discard individual bytes until first marker byte
while True:
try:
offset = find_bits(
0xfaf330,
int.from_bytes(in_stream.peek(3)[:4], 'big'),
3
)
except IndexError:
in_stream.read(1)
else:
break
# in_stream is now positioned directly after the marker
print('byte-offset:', in_stream.tell(), 'bit-offset:', offset)
Alternatively, you can use binary representation to literally find the pattern in the window. Note that you have to mind padding of zero bits, so it is about the same work.
Reading non-aligned bits
Once you have the bit-offset, you can read-and-align data from the file. Basically, read one byte more than you need, then shift as needed:
def align_read(file, num_bytes: int, bit_offset: int):
if bit_offset == 0:
return file.read(num_bytes)
window = file.peek(num_bytes + 1)[:num_bytes + 1]
file.read(num_bytes)
data = (int.from_bytes(window, 'big') >> bit_offset) & (2 ** (num_bytes*8) - 1)
return data.to_bytes(num_bytes, 'big')
If you do not need random access, you can use open
to stream the file. Using file.read
, you can get consecutive bytes from the file. If your file were byte-aligned, you could directly search through it:
in_stream = open('/dev/urandom', 'rb')
# discard individual bytes until first marker byte
while in_stream.peek(1) != b'xfaxf3x30':
in_stream.read(1)
# in_stream is now positioned directly after the marker
print(in_stream.tell())
By default, open
uses a small read buffer but never loads the entire file. You can stream through the file using further in_stream.read
calls.
Alternatively, you can use the result of in_stream.tell()
to jump to the correct position in an mmap'ed file.
Searching non-aligned bits
To manage non-byte aligned data, you must sift through bytes manually: bit-shifting allows to inspect sub-ranges of bytes. Note that Python only allows bit-shifting int
, not bytes
.
>>> pattern = 0xfaf330
>>> bin((pattern << 4) + 0b1011) # pattern shifted by 4 plus garbage
0b1111101011110011001100001011
You can use this to scan a window of bytes:
def find_bits(pattern: int, window: int, n: int):
"""Find an n-byte bit pattern in an n+1-byte window and return the offset"""
for offset in range(8):
window_slice = (window >> offset) & (2 ** (n*8) -1)
if pattern == window_slice:
return offset
raise IndexError('pattern not in window')
You can again use this to scan the file stream:
in_stream = open('/dev/urandom', 'rb')
# discard individual bytes until first marker byte
while True:
try:
offset = find_bits(
0xfaf330,
int.from_bytes(in_stream.peek(3)[:4], 'big'),
3
)
except IndexError:
in_stream.read(1)
else:
break
# in_stream is now positioned directly after the marker
print('byte-offset:', in_stream.tell(), 'bit-offset:', offset)
Alternatively, you can use binary representation to literally find the pattern in the window. Note that you have to mind padding of zero bits, so it is about the same work.
Reading non-aligned bits
Once you have the bit-offset, you can read-and-align data from the file. Basically, read one byte more than you need, then shift as needed:
def align_read(file, num_bytes: int, bit_offset: int):
if bit_offset == 0:
return file.read(num_bytes)
window = file.peek(num_bytes + 1)[:num_bytes + 1]
file.read(num_bytes)
data = (int.from_bytes(window, 'big') >> bit_offset) & (2 ** (num_bytes*8) - 1)
return data.to_bytes(num_bytes, 'big')
edited Nov 15 '18 at 14:27
answered Nov 15 '18 at 13:17
MisterMiyagiMisterMiyagi
8,0162446
8,0162446
This will not work because the beginning of the file is not byte aligned. Meaning that I could read several bytes and come to the sync marker but read just a few bits of it. A subsequent read of one byte would read another misaligned part of the sync marker. Therefore, the marker could be read and not recognized. Thanks for your suggestion.
– GAF
Nov 15 '18 at 13:23
@GAF Sorry, missed that one. AFAIK Python does not support a resolution smaller than bytes - neither for open normmap
nor other means. You will have to bit-shift each chunk.
– MisterMiyagi
Nov 15 '18 at 13:33
Thanks. Appreciate the follow up.
– GAF
Nov 15 '18 at 14:15
@GAF Added a (working) draft how to handle the shifting to find the offset and re-align data. This is probably worth using Cython if your file is large and you read only small chunks at a time.
– MisterMiyagi
Nov 15 '18 at 14:29
add a comment |
This will not work because the beginning of the file is not byte aligned. Meaning that I could read several bytes and come to the sync marker but read just a few bits of it. A subsequent read of one byte would read another misaligned part of the sync marker. Therefore, the marker could be read and not recognized. Thanks for your suggestion.
– GAF
Nov 15 '18 at 13:23
@GAF Sorry, missed that one. AFAIK Python does not support a resolution smaller than bytes - neither for open normmap
nor other means. You will have to bit-shift each chunk.
– MisterMiyagi
Nov 15 '18 at 13:33
Thanks. Appreciate the follow up.
– GAF
Nov 15 '18 at 14:15
@GAF Added a (working) draft how to handle the shifting to find the offset and re-align data. This is probably worth using Cython if your file is large and you read only small chunks at a time.
– MisterMiyagi
Nov 15 '18 at 14:29
This will not work because the beginning of the file is not byte aligned. Meaning that I could read several bytes and come to the sync marker but read just a few bits of it. A subsequent read of one byte would read another misaligned part of the sync marker. Therefore, the marker could be read and not recognized. Thanks for your suggestion.
– GAF
Nov 15 '18 at 13:23
This will not work because the beginning of the file is not byte aligned. Meaning that I could read several bytes and come to the sync marker but read just a few bits of it. A subsequent read of one byte would read another misaligned part of the sync marker. Therefore, the marker could be read and not recognized. Thanks for your suggestion.
– GAF
Nov 15 '18 at 13:23
@GAF Sorry, missed that one. AFAIK Python does not support a resolution smaller than bytes - neither for open nor
mmap
nor other means. You will have to bit-shift each chunk.– MisterMiyagi
Nov 15 '18 at 13:33
@GAF Sorry, missed that one. AFAIK Python does not support a resolution smaller than bytes - neither for open nor
mmap
nor other means. You will have to bit-shift each chunk.– MisterMiyagi
Nov 15 '18 at 13:33
Thanks. Appreciate the follow up.
– GAF
Nov 15 '18 at 14:15
Thanks. Appreciate the follow up.
– GAF
Nov 15 '18 at 14:15
@GAF Added a (working) draft how to handle the shifting to find the offset and re-align data. This is probably worth using Cython if your file is large and you read only small chunks at a time.
– MisterMiyagi
Nov 15 '18 at 14:29
@GAF Added a (working) draft how to handle the shifting to find the offset and re-align data. This is probably worth using Cython if your file is large and you read only small chunks at a time.
– MisterMiyagi
Nov 15 '18 at 14:29
add a comment |
MisterMiyagi's answer is a good solution. Another solution uses the bitstring module.
aFile = open(someFilePath, 'rb')
aBinaryStream = bitstring.ConstBitStream(aFile)
aTuple = aBinaryStream.find('0b111110101111001100100000') #the sync marker
If found, the position in the file is moved to the found location. Then you can read byte aligned data.
aBuffer = aBinaryStream.read('bytes:1024') # to read 1024 bytes
add a comment |
MisterMiyagi's answer is a good solution. Another solution uses the bitstring module.
aFile = open(someFilePath, 'rb')
aBinaryStream = bitstring.ConstBitStream(aFile)
aTuple = aBinaryStream.find('0b111110101111001100100000') #the sync marker
If found, the position in the file is moved to the found location. Then you can read byte aligned data.
aBuffer = aBinaryStream.read('bytes:1024') # to read 1024 bytes
add a comment |
MisterMiyagi's answer is a good solution. Another solution uses the bitstring module.
aFile = open(someFilePath, 'rb')
aBinaryStream = bitstring.ConstBitStream(aFile)
aTuple = aBinaryStream.find('0b111110101111001100100000') #the sync marker
If found, the position in the file is moved to the found location. Then you can read byte aligned data.
aBuffer = aBinaryStream.read('bytes:1024') # to read 1024 bytes
MisterMiyagi's answer is a good solution. Another solution uses the bitstring module.
aFile = open(someFilePath, 'rb')
aBinaryStream = bitstring.ConstBitStream(aFile)
aTuple = aBinaryStream.find('0b111110101111001100100000') #the sync marker
If found, the position in the file is moved to the found location. Then you can read byte aligned data.
aBuffer = aBinaryStream.read('bytes:1024') # to read 1024 bytes
answered Nov 15 '18 at 20:16
GAFGAF
7624
7624
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53319850%2ffinding-a-bit-pattern-in-a-binary-file-using-python-and-memory-map%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Is there are reason why you mmap the file and don't just open and stream it?
– MisterMiyagi
Nov 15 '18 at 12:51
The file is very large and memory mapping helps to manage it.
– GAF
Nov 15 '18 at 12:55
Using
open
will only buffer a portion of the file at any time. Do you need random access? Your description sounds ideal for stream processing.– MisterMiyagi
Nov 15 '18 at 12:57
Subsequently, memoryview helps to process the remaining byte aligned data in chunks based on the file format specification.
– GAF
Nov 15 '18 at 12:57
The data read is subject to Python's regular garbage collection. Unless you hang on to it, it is reclaimed.
– MisterMiyagi
Nov 15 '18 at 13:22