Data manipulation based on trends value

Given a dataset with Date column and Value column, I need to come up with the best solution of segmenting the data by date based on trends in the Value column. My output should be a CSV filewith the columns: StartDate, EndDate,StartValue,EndValue. Start and End date define the bounds of the segment.
A short example is presented: input data:

 **Date** **Value**
 01/01/2014 10
 01/02/2014 5
 01/03/2014 5
 01/04/2014 0

output:

 **StartDate** **EndDate** **StartValue** **EndValue**
 01/01/2014 01/15/2014 10 5
 01/16/2014 02/03/2014 5 5
 02/04/2014 03/10/2014 5 4

asked Nov 13 '18 at 23:15

123josh123

275

add a comment |

 **Date** **Value**
 01/01/2014 10
 01/02/2014 5
 01/03/2014 5
 01/04/2014 0

output:

 **StartDate** **EndDate** **StartValue** **EndValue**
 01/01/2014 01/15/2014 10 5
 01/16/2014 02/03/2014 5 5
 02/04/2014 03/10/2014 5 4

asked Nov 13 '18 at 23:15

123josh123

275

add a comment |

 **Date** **Value**
 01/01/2014 10
 01/02/2014 5
 01/03/2014 5
 01/04/2014 0

output:

 **StartDate** **EndDate** **StartValue** **EndValue**
 01/01/2014 01/15/2014 10 5
 01/16/2014 02/03/2014 5 5
 02/04/2014 03/10/2014 5 4

asked Nov 13 '18 at 23:15

123josh123

275

 **Date** **Value**
 01/01/2014 10
 01/02/2014 5
 01/03/2014 5
 01/04/2014 0

output:

 **StartDate** **EndDate** **StartValue** **EndValue**
 01/01/2014 01/15/2014 10 5
 01/16/2014 02/03/2014 5 5
 02/04/2014 03/10/2014 5 4

python-3.x data-mining data-science data-manipulation

asked Nov 13 '18 at 23:15

123josh123

275

asked Nov 13 '18 at 23:15

123josh123

275

asked Nov 13 '18 at 23:15

123josh123

275

asked Nov 13 '18 at 23:15

123josh123

275

asked Nov 13 '18 at 23:15

123josh123

275

add a comment |

1 Answer
1

active

oldest

votes

An approach using pandas.DataFrame.shift (docs).

Firstly I'll create a dataframe with some data:

import pandas as pd
datelist = pd.date_range('1/1/2019', periods=100).tolist()
values = np.random.randint(1, 5, 100)
df = pd.DataFrame('Date': datelist, 'Value': values)
df = df.set_index('Date')
df.head(10)

Date Value
2019-01-01 1
2019-01-02 4
2019-01-03 2
2019-01-04 2
2019-01-05 2
2019-01-06 3
2019-01-07 2
2019-01-08 2
2019-01-09 3
2019-01-10 2

Drop contiguously duplicate rows:

df = df.loc[df.Value.shift() != df.Value]

Date Value
2019-01-01 2
2019-01-02 1
2019-01-04 2
2019-01-05 3
2019-01-06 1

Reset the index (if the Date column is the index in the original data):

df = df.reset_index()

Rename the existing columns to be the start columns.

df.columns = ['Start_Date', 'Start_Value']

Create end columns by shifting the start columns back one row.

df['End_Date'] = df.Start_Date.shift(-1)
df['End_Value'] = df.Start_Value.shift(-1)

Drop NaNs (the final row of the dataframe due to the shift(-1).

df = df.dropna()

Set the End_Value type to int (if preferred).

df['End_Value'] = df['End_Value'].astype(int)
df.head(10)

 Start_Date Start_Value End_Date End_Value
0 2019-01-01 1 2019-01-02 4
1 2019-01-02 4 2019-01-03 2
2 2019-01-03 2 2019-01-06 3
3 2019-01-06 3 2019-01-07 2
4 2019-01-07 2 2019-01-09 3
5 2019-01-09 3 2019-01-10 2
6 2019-01-10 2 2019-01-11 1
7 2019-01-11 1 2019-01-12 2
8 2019-01-12 2 2019-01-15 1
9 2019-01-15 1 2019-01-16 4

Create a CSV file from the dataframe:

df.to_csv('trends.csv')

edited Jan 5 at 10:05

answered Jan 4 at 14:56

Chris

534213

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53290909%2fdata-manipulation-based-on-trends-value%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

An approach using pandas.DataFrame.shift (docs).

Firstly I'll create a dataframe with some data:

import pandas as pd
datelist = pd.date_range('1/1/2019', periods=100).tolist()
values = np.random.randint(1, 5, 100)
df = pd.DataFrame('Date': datelist, 'Value': values)
df = df.set_index('Date')
df.head(10)

Date Value
2019-01-01 1
2019-01-02 4
2019-01-03 2
2019-01-04 2
2019-01-05 2
2019-01-06 3
2019-01-07 2
2019-01-08 2
2019-01-09 3
2019-01-10 2

Drop contiguously duplicate rows:

df = df.loc[df.Value.shift() != df.Value]

Date Value
2019-01-01 2
2019-01-02 1
2019-01-04 2
2019-01-05 3
2019-01-06 1

Reset the index (if the Date column is the index in the original data):

df = df.reset_index()

Rename the existing columns to be the start columns.

df.columns = ['Start_Date', 'Start_Value']

Create end columns by shifting the start columns back one row.

df['End_Date'] = df.Start_Date.shift(-1)
df['End_Value'] = df.Start_Value.shift(-1)

Drop NaNs (the final row of the dataframe due to the shift(-1).

df = df.dropna()

Set the End_Value type to int (if preferred).

df['End_Value'] = df['End_Value'].astype(int)
df.head(10)

 Start_Date Start_Value End_Date End_Value
0 2019-01-01 1 2019-01-02 4
1 2019-01-02 4 2019-01-03 2
2 2019-01-03 2 2019-01-06 3
3 2019-01-06 3 2019-01-07 2
4 2019-01-07 2 2019-01-09 3
5 2019-01-09 3 2019-01-10 2
6 2019-01-10 2 2019-01-11 1
7 2019-01-11 1 2019-01-12 2
8 2019-01-12 2 2019-01-15 1
9 2019-01-15 1 2019-01-16 4

Create a CSV file from the dataframe:

df.to_csv('trends.csv')

edited Jan 5 at 10:05

answered Jan 4 at 14:56

Chris

534213

add a comment |

An approach using pandas.DataFrame.shift (docs).

Firstly I'll create a dataframe with some data:

import pandas as pd
datelist = pd.date_range('1/1/2019', periods=100).tolist()
values = np.random.randint(1, 5, 100)
df = pd.DataFrame('Date': datelist, 'Value': values)
df = df.set_index('Date')
df.head(10)

Date Value
2019-01-01 1
2019-01-02 4
2019-01-03 2
2019-01-04 2
2019-01-05 2
2019-01-06 3
2019-01-07 2
2019-01-08 2
2019-01-09 3
2019-01-10 2

Drop contiguously duplicate rows:

df = df.loc[df.Value.shift() != df.Value]

Date Value
2019-01-01 2
2019-01-02 1
2019-01-04 2
2019-01-05 3
2019-01-06 1

Reset the index (if the Date column is the index in the original data):

df = df.reset_index()

Rename the existing columns to be the start columns.

df.columns = ['Start_Date', 'Start_Value']

Create end columns by shifting the start columns back one row.

df['End_Date'] = df.Start_Date.shift(-1)
df['End_Value'] = df.Start_Value.shift(-1)

Drop NaNs (the final row of the dataframe due to the shift(-1).

df = df.dropna()

Set the End_Value type to int (if preferred).

df['End_Value'] = df['End_Value'].astype(int)
df.head(10)

 Start_Date Start_Value End_Date End_Value
0 2019-01-01 1 2019-01-02 4
1 2019-01-02 4 2019-01-03 2
2 2019-01-03 2 2019-01-06 3
3 2019-01-06 3 2019-01-07 2
4 2019-01-07 2 2019-01-09 3
5 2019-01-09 3 2019-01-10 2
6 2019-01-10 2 2019-01-11 1
7 2019-01-11 1 2019-01-12 2
8 2019-01-12 2 2019-01-15 1
9 2019-01-15 1 2019-01-16 4

Create a CSV file from the dataframe:

df.to_csv('trends.csv')

edited Jan 5 at 10:05

answered Jan 4 at 14:56

Chris

534213

add a comment |

An approach using pandas.DataFrame.shift (docs).

Firstly I'll create a dataframe with some data:

import pandas as pd
datelist = pd.date_range('1/1/2019', periods=100).tolist()
values = np.random.randint(1, 5, 100)
df = pd.DataFrame('Date': datelist, 'Value': values)
df = df.set_index('Date')
df.head(10)

Date Value
2019-01-01 1
2019-01-02 4
2019-01-03 2
2019-01-04 2
2019-01-05 2
2019-01-06 3
2019-01-07 2
2019-01-08 2
2019-01-09 3
2019-01-10 2

Drop contiguously duplicate rows:

df = df.loc[df.Value.shift() != df.Value]

Date Value
2019-01-01 2
2019-01-02 1
2019-01-04 2
2019-01-05 3
2019-01-06 1

Reset the index (if the Date column is the index in the original data):

df = df.reset_index()

Rename the existing columns to be the start columns.

df.columns = ['Start_Date', 'Start_Value']

Create end columns by shifting the start columns back one row.

df['End_Date'] = df.Start_Date.shift(-1)
df['End_Value'] = df.Start_Value.shift(-1)

Drop NaNs (the final row of the dataframe due to the shift(-1).

df = df.dropna()

Set the End_Value type to int (if preferred).

df['End_Value'] = df['End_Value'].astype(int)
df.head(10)

 Start_Date Start_Value End_Date End_Value
0 2019-01-01 1 2019-01-02 4
1 2019-01-02 4 2019-01-03 2
2 2019-01-03 2 2019-01-06 3
3 2019-01-06 3 2019-01-07 2
4 2019-01-07 2 2019-01-09 3
5 2019-01-09 3 2019-01-10 2
6 2019-01-10 2 2019-01-11 1
7 2019-01-11 1 2019-01-12 2
8 2019-01-12 2 2019-01-15 1
9 2019-01-15 1 2019-01-16 4

Create a CSV file from the dataframe:

df.to_csv('trends.csv')

edited Jan 5 at 10:05

answered Jan 4 at 14:56

Chris

534213

An approach using pandas.DataFrame.shift (docs).

Firstly I'll create a dataframe with some data:

import pandas as pd
datelist = pd.date_range('1/1/2019', periods=100).tolist()
values = np.random.randint(1, 5, 100)
df = pd.DataFrame('Date': datelist, 'Value': values)
df = df.set_index('Date')
df.head(10)

Date Value
2019-01-01 1
2019-01-02 4
2019-01-03 2
2019-01-04 2
2019-01-05 2
2019-01-06 3
2019-01-07 2
2019-01-08 2
2019-01-09 3
2019-01-10 2

Drop contiguously duplicate rows:

df = df.loc[df.Value.shift() != df.Value]

Date Value
2019-01-01 2
2019-01-02 1
2019-01-04 2
2019-01-05 3
2019-01-06 1

Reset the index (if the Date column is the index in the original data):

df = df.reset_index()

Rename the existing columns to be the start columns.

df.columns = ['Start_Date', 'Start_Value']

Create end columns by shifting the start columns back one row.

df['End_Date'] = df.Start_Date.shift(-1)
df['End_Value'] = df.Start_Value.shift(-1)

Drop NaNs (the final row of the dataframe due to the shift(-1).

df = df.dropna()

Set the End_Value type to int (if preferred).

df['End_Value'] = df['End_Value'].astype(int)
df.head(10)

 Start_Date Start_Value End_Date End_Value
0 2019-01-01 1 2019-01-02 4
1 2019-01-02 4 2019-01-03 2
2 2019-01-03 2 2019-01-06 3
3 2019-01-06 3 2019-01-07 2
4 2019-01-07 2 2019-01-09 3
5 2019-01-09 3 2019-01-10 2
6 2019-01-10 2 2019-01-11 1
7 2019-01-11 1 2019-01-12 2
8 2019-01-12 2 2019-01-15 1
9 2019-01-15 1 2019-01-16 4

Create a CSV file from the dataframe:

df.to_csv('trends.csv')

edited Jan 5 at 10:05

answered Jan 4 at 14:56

Chris

534213

edited Jan 5 at 10:05

answered Jan 4 at 14:56

Chris

534213

answered Jan 4 at 14:56

Chris

534213

answered Jan 4 at 14:56

Chris

534213

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Odtnhj