cleaning raw text files with vertical bars and converting them into csv









up vote
0
down vote

favorite












I have a huge raw dataset( 4k lines each text file) with a lot of vertical bars and spaces.



 |group call| pvt call |phone call|group busy| pvt busy |phone busy| 
time |total |total |total |total |total |total | %
period| sec cnt | sec cnt| sec cnt| sec cnt| sec cnt| sec cnt | usage

00:00 | 4323 548| 0 0| 0 0| 0 0| 0 0| 0 0| 18%

00:15 | 4125 479| 0 0| 0 0| 0 0| 0 0| 0 0| 17%

00:30 | 3071 395| 0 0| 0 0| 0 0| 0 0| 0 0| 13%

00:45 | 3514 447| 0 0| 0 0| 0 0| 0 0| 0 0| 14%

01:00 | 3081 383| 0 0| 0 0| 0 0| 0 0| 0 0| 13%


I want to convert it into a csv file.
The parser that I built using python and pandas only reads csv values. How can I do so?
The csv file should look something like:



time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec, pvt_busy_t_c,phone_busy_t_sec, phone_busy_t_c, per_usage
00:00,4323,548,0,0,0,0,0,0,0,0,0,0,18%
00:15,4125,479,0,0,0,0,0,0,0,0,0,0,17%
00:30,3071,395,0,0,0,0,0,0,0,0,0,0,13%
00:45,3514,447,0,0,0,0,0,0,0,0,0,0,14%
01:00,3081,383,0,0,0,0,0,0,0,0,0,0,13%
01:15,4017,470,0,0,0,0,0,0,0,0,0,0,18%
01:30,4767,555,0,0,0,0,0,0,0,0,0,0,18%









share|improve this question























  • If you can guarantee that there will not be embedded spaces in any of the fields, then try gsub('[ |]+', ',', readLines(filename)) for basic sep= conversion. Your column naming is a bit more work, though.
    – r2evans
    Nov 11 at 4:21















up vote
0
down vote

favorite












I have a huge raw dataset( 4k lines each text file) with a lot of vertical bars and spaces.



 |group call| pvt call |phone call|group busy| pvt busy |phone busy| 
time |total |total |total |total |total |total | %
period| sec cnt | sec cnt| sec cnt| sec cnt| sec cnt| sec cnt | usage

00:00 | 4323 548| 0 0| 0 0| 0 0| 0 0| 0 0| 18%

00:15 | 4125 479| 0 0| 0 0| 0 0| 0 0| 0 0| 17%

00:30 | 3071 395| 0 0| 0 0| 0 0| 0 0| 0 0| 13%

00:45 | 3514 447| 0 0| 0 0| 0 0| 0 0| 0 0| 14%

01:00 | 3081 383| 0 0| 0 0| 0 0| 0 0| 0 0| 13%


I want to convert it into a csv file.
The parser that I built using python and pandas only reads csv values. How can I do so?
The csv file should look something like:



time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec, pvt_busy_t_c,phone_busy_t_sec, phone_busy_t_c, per_usage
00:00,4323,548,0,0,0,0,0,0,0,0,0,0,18%
00:15,4125,479,0,0,0,0,0,0,0,0,0,0,17%
00:30,3071,395,0,0,0,0,0,0,0,0,0,0,13%
00:45,3514,447,0,0,0,0,0,0,0,0,0,0,14%
01:00,3081,383,0,0,0,0,0,0,0,0,0,0,13%
01:15,4017,470,0,0,0,0,0,0,0,0,0,0,18%
01:30,4767,555,0,0,0,0,0,0,0,0,0,0,18%









share|improve this question























  • If you can guarantee that there will not be embedded spaces in any of the fields, then try gsub('[ |]+', ',', readLines(filename)) for basic sep= conversion. Your column naming is a bit more work, though.
    – r2evans
    Nov 11 at 4:21













up vote
0
down vote

favorite









up vote
0
down vote

favorite











I have a huge raw dataset( 4k lines each text file) with a lot of vertical bars and spaces.



 |group call| pvt call |phone call|group busy| pvt busy |phone busy| 
time |total |total |total |total |total |total | %
period| sec cnt | sec cnt| sec cnt| sec cnt| sec cnt| sec cnt | usage

00:00 | 4323 548| 0 0| 0 0| 0 0| 0 0| 0 0| 18%

00:15 | 4125 479| 0 0| 0 0| 0 0| 0 0| 0 0| 17%

00:30 | 3071 395| 0 0| 0 0| 0 0| 0 0| 0 0| 13%

00:45 | 3514 447| 0 0| 0 0| 0 0| 0 0| 0 0| 14%

01:00 | 3081 383| 0 0| 0 0| 0 0| 0 0| 0 0| 13%


I want to convert it into a csv file.
The parser that I built using python and pandas only reads csv values. How can I do so?
The csv file should look something like:



time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec, pvt_busy_t_c,phone_busy_t_sec, phone_busy_t_c, per_usage
00:00,4323,548,0,0,0,0,0,0,0,0,0,0,18%
00:15,4125,479,0,0,0,0,0,0,0,0,0,0,17%
00:30,3071,395,0,0,0,0,0,0,0,0,0,0,13%
00:45,3514,447,0,0,0,0,0,0,0,0,0,0,14%
01:00,3081,383,0,0,0,0,0,0,0,0,0,0,13%
01:15,4017,470,0,0,0,0,0,0,0,0,0,0,18%
01:30,4767,555,0,0,0,0,0,0,0,0,0,0,18%









share|improve this question















I have a huge raw dataset( 4k lines each text file) with a lot of vertical bars and spaces.



 |group call| pvt call |phone call|group busy| pvt busy |phone busy| 
time |total |total |total |total |total |total | %
period| sec cnt | sec cnt| sec cnt| sec cnt| sec cnt| sec cnt | usage

00:00 | 4323 548| 0 0| 0 0| 0 0| 0 0| 0 0| 18%

00:15 | 4125 479| 0 0| 0 0| 0 0| 0 0| 0 0| 17%

00:30 | 3071 395| 0 0| 0 0| 0 0| 0 0| 0 0| 13%

00:45 | 3514 447| 0 0| 0 0| 0 0| 0 0| 0 0| 14%

01:00 | 3081 383| 0 0| 0 0| 0 0| 0 0| 0 0| 13%


I want to convert it into a csv file.
The parser that I built using python and pandas only reads csv values. How can I do so?
The csv file should look something like:



time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec, pvt_busy_t_c,phone_busy_t_sec, phone_busy_t_c, per_usage
00:00,4323,548,0,0,0,0,0,0,0,0,0,0,18%
00:15,4125,479,0,0,0,0,0,0,0,0,0,0,17%
00:30,3071,395,0,0,0,0,0,0,0,0,0,0,13%
00:45,3514,447,0,0,0,0,0,0,0,0,0,0,14%
01:00,3081,383,0,0,0,0,0,0,0,0,0,0,13%
01:15,4017,470,0,0,0,0,0,0,0,0,0,0,18%
01:30,4767,555,0,0,0,0,0,0,0,0,0,0,18%






python r pandas csv






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 11 at 4:22









Eray Balkanli

3,85741943




3,85741943










asked Nov 11 at 4:19









Hilde Schneider

53




53











  • If you can guarantee that there will not be embedded spaces in any of the fields, then try gsub('[ |]+', ',', readLines(filename)) for basic sep= conversion. Your column naming is a bit more work, though.
    – r2evans
    Nov 11 at 4:21

















  • If you can guarantee that there will not be embedded spaces in any of the fields, then try gsub('[ |]+', ',', readLines(filename)) for basic sep= conversion. Your column naming is a bit more work, though.
    – r2evans
    Nov 11 at 4:21
















If you can guarantee that there will not be embedded spaces in any of the fields, then try gsub('[ |]+', ',', readLines(filename)) for basic sep= conversion. Your column naming is a bit more work, though.
– r2evans
Nov 11 at 4:21





If you can guarantee that there will not be embedded spaces in any of the fields, then try gsub('[ |]+', ',', readLines(filename)) for basic sep= conversion. Your column naming is a bit more work, though.
– r2evans
Nov 11 at 4:21













1 Answer
1






active

oldest

votes

















up vote
0
down vote













Python



If all files have the same header structure, you can read the data part, assign the headers, and then save to CSV:



data = pd.read_csv("file1.txt", sep=r's*|?s*', header=None, skiprows=3)
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13
#0 00:00 4323 548 0 0 0 0 0 0 0 0 0 0 18%
#1 00:15 4125 479 0 0 0 0 0 0 0 0 0 0 17%
#2 00:30 3071 395 0 0 0 0 0 0 0 0 0 0 13%
#3 00:45 3514 447 0 0 0 0 0 0 0 0 0 0 14%
#4 01:00 3081 383 0 0 0 0 0 0 0 0 0 0 13%
data.columns = "time_pd","group_call_t_s","group_call_t_c",...
data.to_csv("file1.csv", index=None)





share|improve this answer




















  • I always get this error. TypeError: Index(...) must be called with a collection of some kind, 'time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec,pvt_busy_t_c,phone_busy_t_sec,phone_busy_t_c,per_usage' was passed
    – Hilde Schneider
    Nov 11 at 8:13










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53245801%2fcleaning-raw-text-files-with-vertical-bars-and-converting-them-into-csv%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
0
down vote













Python



If all files have the same header structure, you can read the data part, assign the headers, and then save to CSV:



data = pd.read_csv("file1.txt", sep=r's*|?s*', header=None, skiprows=3)
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13
#0 00:00 4323 548 0 0 0 0 0 0 0 0 0 0 18%
#1 00:15 4125 479 0 0 0 0 0 0 0 0 0 0 17%
#2 00:30 3071 395 0 0 0 0 0 0 0 0 0 0 13%
#3 00:45 3514 447 0 0 0 0 0 0 0 0 0 0 14%
#4 01:00 3081 383 0 0 0 0 0 0 0 0 0 0 13%
data.columns = "time_pd","group_call_t_s","group_call_t_c",...
data.to_csv("file1.csv", index=None)





share|improve this answer




















  • I always get this error. TypeError: Index(...) must be called with a collection of some kind, 'time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec,pvt_busy_t_c,phone_busy_t_sec,phone_busy_t_c,per_usage' was passed
    – Hilde Schneider
    Nov 11 at 8:13














up vote
0
down vote













Python



If all files have the same header structure, you can read the data part, assign the headers, and then save to CSV:



data = pd.read_csv("file1.txt", sep=r's*|?s*', header=None, skiprows=3)
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13
#0 00:00 4323 548 0 0 0 0 0 0 0 0 0 0 18%
#1 00:15 4125 479 0 0 0 0 0 0 0 0 0 0 17%
#2 00:30 3071 395 0 0 0 0 0 0 0 0 0 0 13%
#3 00:45 3514 447 0 0 0 0 0 0 0 0 0 0 14%
#4 01:00 3081 383 0 0 0 0 0 0 0 0 0 0 13%
data.columns = "time_pd","group_call_t_s","group_call_t_c",...
data.to_csv("file1.csv", index=None)





share|improve this answer




















  • I always get this error. TypeError: Index(...) must be called with a collection of some kind, 'time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec,pvt_busy_t_c,phone_busy_t_sec,phone_busy_t_c,per_usage' was passed
    – Hilde Schneider
    Nov 11 at 8:13












up vote
0
down vote










up vote
0
down vote









Python



If all files have the same header structure, you can read the data part, assign the headers, and then save to CSV:



data = pd.read_csv("file1.txt", sep=r's*|?s*', header=None, skiprows=3)
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13
#0 00:00 4323 548 0 0 0 0 0 0 0 0 0 0 18%
#1 00:15 4125 479 0 0 0 0 0 0 0 0 0 0 17%
#2 00:30 3071 395 0 0 0 0 0 0 0 0 0 0 13%
#3 00:45 3514 447 0 0 0 0 0 0 0 0 0 0 14%
#4 01:00 3081 383 0 0 0 0 0 0 0 0 0 0 13%
data.columns = "time_pd","group_call_t_s","group_call_t_c",...
data.to_csv("file1.csv", index=None)





share|improve this answer












Python



If all files have the same header structure, you can read the data part, assign the headers, and then save to CSV:



data = pd.read_csv("file1.txt", sep=r's*|?s*', header=None, skiprows=3)
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13
#0 00:00 4323 548 0 0 0 0 0 0 0 0 0 0 18%
#1 00:15 4125 479 0 0 0 0 0 0 0 0 0 0 17%
#2 00:30 3071 395 0 0 0 0 0 0 0 0 0 0 13%
#3 00:45 3514 447 0 0 0 0 0 0 0 0 0 0 14%
#4 01:00 3081 383 0 0 0 0 0 0 0 0 0 0 13%
data.columns = "time_pd","group_call_t_s","group_call_t_c",...
data.to_csv("file1.csv", index=None)






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 11 at 4:38









DYZ

24.4k61948




24.4k61948











  • I always get this error. TypeError: Index(...) must be called with a collection of some kind, 'time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec,pvt_busy_t_c,phone_busy_t_sec,phone_busy_t_c,per_usage' was passed
    – Hilde Schneider
    Nov 11 at 8:13
















  • I always get this error. TypeError: Index(...) must be called with a collection of some kind, 'time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec,pvt_busy_t_c,phone_busy_t_sec,phone_busy_t_c,per_usage' was passed
    – Hilde Schneider
    Nov 11 at 8:13















I always get this error. TypeError: Index(...) must be called with a collection of some kind, 'time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec,pvt_busy_t_c,phone_busy_t_sec,phone_busy_t_c,per_usage' was passed
– Hilde Schneider
Nov 11 at 8:13




I always get this error. TypeError: Index(...) must be called with a collection of some kind, 'time_pd,group_call_t_s,group_call_t_c,pvt_call_t_sec,pvt_call_t_c,phone_call_t_sec,phone_call_t_c,group_busy_t_sec,group_busy_t_c,pvt_busy_t_sec,pvt_busy_t_c,phone_busy_t_sec,phone_busy_t_c,per_usage' was passed
– Hilde Schneider
Nov 11 at 8:13

















 

draft saved


draft discarded















































 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53245801%2fcleaning-raw-text-files-with-vertical-bars-and-converting-them-into-csv%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

How to read a connectionString WITH PROVIDER in .NET Core?

Node.js Script on GitHub Pages or Amazon S3

Museum of Modern and Contemporary Art of Trento and Rovereto