cleaning raw text files with vertical bars and converting them into csv

up vote
0
down vote

favorite

I have a huge raw dataset( 4k lines each text file) with a lot of vertical bars and spaces.

 |group call| pvt call |phone call|group busy| pvt busy |phone busy| 
 time |total |total |total |total |total |total | % 
period| sec cnt | sec cnt| sec cnt| sec cnt| sec cnt| sec cnt | usage

00:00 | 4323 548| 0 0| 0 0| 0 0| 0 0| 0 0| 18%

00:15 | 4125 479| 0 0| 0 0| 0 0| 0 0| 0 0| 17%

00:30 | 3071 395| 0 0| 0 0| 0 0| 0 0| 0 0| 13%

00:45 | 3514 447| 0 0| 0 0| 0 0| 0 0| 0 0| 14%

01:00 | 3081 383| 0 0| 0 0| 0 0| 0 0| 0 0| 13%

I want to convert it into a csv file.
The parser that I built using python and pandas only reads csv values. How can I do so?
The csv file should look something like:

time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec, pvt_busy_t_c,phone_busy_t_sec, phone_busy_t_c, per_usage
00:00,4323,548,0,0,0,0,0,0,0,0,0,0,18%
00:15,4125,479,0,0,0,0,0,0,0,0,0,0,17%
00:30,3071,395,0,0,0,0,0,0,0,0,0,0,13%
00:45,3514,447,0,0,0,0,0,0,0,0,0,0,14%
01:00,3081,383,0,0,0,0,0,0,0,0,0,0,13%
01:15,4017,470,0,0,0,0,0,0,0,0,0,0,18%
01:30,4767,555,0,0,0,0,0,0,0,0,0,0,18%

edited Nov 11 at 4:22

Eray Balkanli

3,85741943

asked Nov 11 at 4:19

Hilde Schneider

If you can guarantee that there will not be embedded spaces in any of the fields, then try gsub('[ |]+', ',', readLines(filename)) for basic sep= conversion. Your column naming is a bit more work, though.
– r2evans
Nov 11 at 4:21

add a comment |

up vote
0
down vote

favorite

I have a huge raw dataset( 4k lines each text file) with a lot of vertical bars and spaces.

 |group call| pvt call |phone call|group busy| pvt busy |phone busy| 
 time |total |total |total |total |total |total | % 
period| sec cnt | sec cnt| sec cnt| sec cnt| sec cnt| sec cnt | usage

00:00 | 4323 548| 0 0| 0 0| 0 0| 0 0| 0 0| 18%

00:15 | 4125 479| 0 0| 0 0| 0 0| 0 0| 0 0| 17%

00:30 | 3071 395| 0 0| 0 0| 0 0| 0 0| 0 0| 13%

00:45 | 3514 447| 0 0| 0 0| 0 0| 0 0| 0 0| 14%

01:00 | 3081 383| 0 0| 0 0| 0 0| 0 0| 0 0| 13%

I want to convert it into a csv file.
The parser that I built using python and pandas only reads csv values. How can I do so?
The csv file should look something like:

time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec, pvt_busy_t_c,phone_busy_t_sec, phone_busy_t_c, per_usage
00:00,4323,548,0,0,0,0,0,0,0,0,0,0,18%
00:15,4125,479,0,0,0,0,0,0,0,0,0,0,17%
00:30,3071,395,0,0,0,0,0,0,0,0,0,0,13%
00:45,3514,447,0,0,0,0,0,0,0,0,0,0,14%
01:00,3081,383,0,0,0,0,0,0,0,0,0,0,13%
01:15,4017,470,0,0,0,0,0,0,0,0,0,0,18%
01:30,4767,555,0,0,0,0,0,0,0,0,0,0,18%

edited Nov 11 at 4:22

Eray Balkanli

3,85741943

asked Nov 11 at 4:19

Hilde Schneider

If you can guarantee that there will not be embedded spaces in any of the fields, then try gsub('[ |]+', ',', readLines(filename)) for basic sep= conversion. Your column naming is a bit more work, though.
– r2evans
Nov 11 at 4:21

add a comment |

up vote
0
down vote

favorite

I have a huge raw dataset( 4k lines each text file) with a lot of vertical bars and spaces.

 |group call| pvt call |phone call|group busy| pvt busy |phone busy| 
 time |total |total |total |total |total |total | % 
period| sec cnt | sec cnt| sec cnt| sec cnt| sec cnt| sec cnt | usage

00:00 | 4323 548| 0 0| 0 0| 0 0| 0 0| 0 0| 18%

00:15 | 4125 479| 0 0| 0 0| 0 0| 0 0| 0 0| 17%

00:30 | 3071 395| 0 0| 0 0| 0 0| 0 0| 0 0| 13%

00:45 | 3514 447| 0 0| 0 0| 0 0| 0 0| 0 0| 14%

01:00 | 3081 383| 0 0| 0 0| 0 0| 0 0| 0 0| 13%

I want to convert it into a csv file.
The parser that I built using python and pandas only reads csv values. How can I do so?
The csv file should look something like:

time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec, pvt_busy_t_c,phone_busy_t_sec, phone_busy_t_c, per_usage
00:00,4323,548,0,0,0,0,0,0,0,0,0,0,18%
00:15,4125,479,0,0,0,0,0,0,0,0,0,0,17%
00:30,3071,395,0,0,0,0,0,0,0,0,0,0,13%
00:45,3514,447,0,0,0,0,0,0,0,0,0,0,14%
01:00,3081,383,0,0,0,0,0,0,0,0,0,0,13%
01:15,4017,470,0,0,0,0,0,0,0,0,0,0,18%
01:30,4767,555,0,0,0,0,0,0,0,0,0,0,18%

edited Nov 11 at 4:22

Eray Balkanli

3,85741943

asked Nov 11 at 4:19

Hilde Schneider

I have a huge raw dataset( 4k lines each text file) with a lot of vertical bars and spaces.

 |group call| pvt call |phone call|group busy| pvt busy |phone busy| 
 time |total |total |total |total |total |total | % 
period| sec cnt | sec cnt| sec cnt| sec cnt| sec cnt| sec cnt | usage

00:00 | 4323 548| 0 0| 0 0| 0 0| 0 0| 0 0| 18%

00:15 | 4125 479| 0 0| 0 0| 0 0| 0 0| 0 0| 17%

00:30 | 3071 395| 0 0| 0 0| 0 0| 0 0| 0 0| 13%

00:45 | 3514 447| 0 0| 0 0| 0 0| 0 0| 0 0| 14%

01:00 | 3081 383| 0 0| 0 0| 0 0| 0 0| 0 0| 13%

I want to convert it into a csv file.
The parser that I built using python and pandas only reads csv values. How can I do so?
The csv file should look something like:

time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec, pvt_busy_t_c,phone_busy_t_sec, phone_busy_t_c, per_usage
00:00,4323,548,0,0,0,0,0,0,0,0,0,0,18%
00:15,4125,479,0,0,0,0,0,0,0,0,0,0,17%
00:30,3071,395,0,0,0,0,0,0,0,0,0,0,13%
00:45,3514,447,0,0,0,0,0,0,0,0,0,0,14%
01:00,3081,383,0,0,0,0,0,0,0,0,0,0,13%
01:15,4017,470,0,0,0,0,0,0,0,0,0,0,18%
01:30,4767,555,0,0,0,0,0,0,0,0,0,0,18%

python r pandas csv

edited Nov 11 at 4:22

Eray Balkanli

3,85741943

asked Nov 11 at 4:19

Hilde Schneider

edited Nov 11 at 4:22

Eray Balkanli

3,85741943

asked Nov 11 at 4:19

Hilde Schneider

edited Nov 11 at 4:22

Eray Balkanli

3,85741943

edited Nov 11 at 4:22

Eray Balkanli

3,85741943

edited Nov 11 at 4:22

Eray Balkanli

3,85741943

asked Nov 11 at 4:19

Hilde Schneider

asked Nov 11 at 4:19

Hilde Schneider

asked Nov 11 at 4:19

Hilde Schneider

If you can guarantee that there will not be embedded spaces in any of the fields, then try gsub('[ |]+', ',', readLines(filename)) for basic sep= conversion. Your column naming is a bit more work, though.
– r2evans
Nov 11 at 4:21

add a comment |

If you can guarantee that there will not be embedded spaces in any of the fields, then try gsub('[ |]+', ',', readLines(filename)) for basic sep= conversion. Your column naming is a bit more work, though.
– r2evans
Nov 11 at 4:21

If you can guarantee that there will not be embedded spaces in any of the fields, then try gsub('[ |]+', ',', readLines(filename)) for basic sep= conversion. Your column naming is a bit more work, though.
– r2evans
Nov 11 at 4:21

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

Python

If all files have the same header structure, you can read the data part, assign the headers, and then save to CSV:

data = pd.read_csv("file1.txt", sep=r's*|?s*', header=None, skiprows=3)
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13
#0 00:00 4323 548 0 0 0 0 0 0 0 0 0 0 18%
#1 00:15 4125 479 0 0 0 0 0 0 0 0 0 0 17%
#2 00:30 3071 395 0 0 0 0 0 0 0 0 0 0 13%
#3 00:45 3514 447 0 0 0 0 0 0 0 0 0 0 14%
#4 01:00 3081 383 0 0 0 0 0 0 0 0 0 0 13%
data.columns = "time_pd","group_call_t_s","group_call_t_c",...
data.to_csv("file1.csv", index=None)

answered Nov 11 at 4:38

DYZ

24.4k61948

I always get this error. TypeError: Index(...) must be called with a collection of some kind, 'time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec,pvt_busy_t_c,phone_busy_t_sec,phone_busy_t_c,per_usage' was passed
– Hilde Schneider
Nov 11 at 8:13

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53245801%2fcleaning-raw-text-files-with-vertical-bars-and-converting-them-into-csv%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

Python

If all files have the same header structure, you can read the data part, assign the headers, and then save to CSV:

data = pd.read_csv("file1.txt", sep=r's*|?s*', header=None, skiprows=3)
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13
#0 00:00 4323 548 0 0 0 0 0 0 0 0 0 0 18%
#1 00:15 4125 479 0 0 0 0 0 0 0 0 0 0 17%
#2 00:30 3071 395 0 0 0 0 0 0 0 0 0 0 13%
#3 00:45 3514 447 0 0 0 0 0 0 0 0 0 0 14%
#4 01:00 3081 383 0 0 0 0 0 0 0 0 0 0 13%
data.columns = "time_pd","group_call_t_s","group_call_t_c",...
data.to_csv("file1.csv", index=None)

answered Nov 11 at 4:38

DYZ

24.4k61948

I always get this error. TypeError: Index(...) must be called with a collection of some kind, 'time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec,pvt_busy_t_c,phone_busy_t_sec,phone_busy_t_c,per_usage' was passed
– Hilde Schneider
Nov 11 at 8:13

add a comment |

up vote
0
down vote

Python

If all files have the same header structure, you can read the data part, assign the headers, and then save to CSV:

data = pd.read_csv("file1.txt", sep=r's*|?s*', header=None, skiprows=3)
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13
#0 00:00 4323 548 0 0 0 0 0 0 0 0 0 0 18%
#1 00:15 4125 479 0 0 0 0 0 0 0 0 0 0 17%
#2 00:30 3071 395 0 0 0 0 0 0 0 0 0 0 13%
#3 00:45 3514 447 0 0 0 0 0 0 0 0 0 0 14%
#4 01:00 3081 383 0 0 0 0 0 0 0 0 0 0 13%
data.columns = "time_pd","group_call_t_s","group_call_t_c",...
data.to_csv("file1.csv", index=None)

answered Nov 11 at 4:38

DYZ

24.4k61948

I always get this error. TypeError: Index(...) must be called with a collection of some kind, 'time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec,pvt_busy_t_c,phone_busy_t_sec,phone_busy_t_c,per_usage' was passed
– Hilde Schneider
Nov 11 at 8:13

add a comment |

up vote
0
down vote

Python

If all files have the same header structure, you can read the data part, assign the headers, and then save to CSV:

data = pd.read_csv("file1.txt", sep=r's*|?s*', header=None, skiprows=3)
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13
#0 00:00 4323 548 0 0 0 0 0 0 0 0 0 0 18%
#1 00:15 4125 479 0 0 0 0 0 0 0 0 0 0 17%
#2 00:30 3071 395 0 0 0 0 0 0 0 0 0 0 13%
#3 00:45 3514 447 0 0 0 0 0 0 0 0 0 0 14%
#4 01:00 3081 383 0 0 0 0 0 0 0 0 0 0 13%
data.columns = "time_pd","group_call_t_s","group_call_t_c",...
data.to_csv("file1.csv", index=None)

answered Nov 11 at 4:38

DYZ

24.4k61948

Python

If all files have the same header structure, you can read the data part, assign the headers, and then save to CSV:

data = pd.read_csv("file1.txt", sep=r's*|?s*', header=None, skiprows=3)
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13
#0 00:00 4323 548 0 0 0 0 0 0 0 0 0 0 18%
#1 00:15 4125 479 0 0 0 0 0 0 0 0 0 0 17%
#2 00:30 3071 395 0 0 0 0 0 0 0 0 0 0 13%
#3 00:45 3514 447 0 0 0 0 0 0 0 0 0 0 14%
#4 01:00 3081 383 0 0 0 0 0 0 0 0 0 0 13%
data.columns = "time_pd","group_call_t_s","group_call_t_c",...
data.to_csv("file1.csv", index=None)

answered Nov 11 at 4:38

DYZ

24.4k61948

answered Nov 11 at 4:38

DYZ

24.4k61948

answered Nov 11 at 4:38

DYZ

24.4k61948

answered Nov 11 at 4:38

DYZ

24.4k61948

I always get this error. TypeError: Index(...) must be called with a collection of some kind, 'time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec,pvt_busy_t_c,phone_busy_t_sec,phone_busy_t_c,per_usage' was passed
– Hilde Schneider
Nov 11 at 8:13

add a comment |

I always get this error. TypeError: Index(...) must be called with a collection of some kind, 'time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec,pvt_busy_t_c,phone_busy_t_sec,phone_busy_t_c,per_usage' was passed
– Hilde Schneider
Nov 11 at 8:13

I always get this error. TypeError: Index(...) must be called with a collection of some kind, 'time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec,pvt_busy_t_c,phone_busy_t_sec,phone_busy_t_c,per_usage' was passed
– Hilde Schneider
Nov 11 at 8:13

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Odtnhj