Problems while trying to create a new id column based on three criteria?









up vote
3
down vote

favorite
1












I have a dataframe with conversations and timestamps like this:



timestamp userID textBlob new_id
2018-10-05 23:07:02 01 a large text blob...
2018-10-05 23:07:13 01 a large text blob...
2018-10-05 23:07:23 01 a large text blob...
2018-10-05 23:07:36 01 a large text blob...
2018-10-05 23:08:02 01 a large text blob...
2018-10-05 23:09:16 01 a large text blob...
2018-10-05 23:09:21 01 a large text blob...
2018-10-05 23:09:39 01 a large text blob...
2018-10-05 23:09:47 01 a large text blob...
2018-10-05 23:10:01 01 a large text blob...
2018-10-05 23:10:11 01 a large text blob...
2018-10-05 23:10:23 01 restart
2018-10-05 23:10:59 01 a large text blob...
2018-10-05 23:11:03 01 a large text blob...
2018-10-08 23:11:32 02 a large text blob...
2018-10-08 23:12:58 02 a large text blob...
2018-10-08 23:13:16 02 a large text blob...
2018-10-08 23:14:04 02 a large text blob...
2018-10-08 03:38:36 02 a large text blob...
2018-10-08 03:38:42 02 a large text blob...
2018-10-08 03:38:52 02 a large text blob...
2018-10-08 03:38:57 02 a large text blob...
2018-10-08 03:39:10 02 a large text blob...
2018-10-08 03:39:27 02 Restart
2018-10-08 03:40:47 02 a large text blob...
2018-10-08 03:40:54 02 a large text blob...
2018-10-08 03:41:02 02 a large text blob...
2018-10-08 03:41:12 02 a large text blob...
2018-10-08 03:41:32 02 a large text blob...
2018-10-08 03:41:39 02 a large text blob...
2018-10-08 03:42:20 02 a large text blob...
2018-10-08 03:44:58 02 a large text blob...
2018-10-08 03:45:54 02 a large text blob...
2018-10-08 03:46:06 02 a large text blob...
2018-10-08 05:06:42 03 a large text blob...
2018-10-08 05:06:53 03 a large text blob...
2018-10-08 05:08:49 03 a large text blob...
2018-10-08 05:08:58 03 a large text blob...
2018-10-08 05:58:18 04 a large text blob...
2018-10-08 05:58:26 04 a large text blob...
2018-10-08 05:58:37 04 a large text blob...
2018-10-08 05:58:58 04 a large text blob...
2018-10-08 06:00:31 04 a large text blob...
2018-10-08 06:01:00 04 a large text blob...
2018-10-08 06:01:14 04 a large text blob...
2018-10-08 06:02:03 04 a large text blob...
2018-10-08 06:02:03 04 a large text blob...
2018-10-08 06:06:03 04 a large text blob...
2018-10-08 06:10:00 04 a large text blob...
2018-10-08 09:07:03 04 a large text blob...
2018-10-08 09:09:03 04 a large text blob...
2018-10-09 10:01:00 04 a large text blob...
2018-10-09 10:02:00 04 a large text blob...
2018-10-09 10:03:00 04 a large text blob...
2018-10-09 10:09:00 04 a large text blob...
2018-10-09 10:09:00 05 a large text blob...


At the moment I would like to identify with an id the conversations inside the dataframe. The problem is that a user can have several conversations (i.e. an userID can have multiple textBlob associated). Thus, I would like to add a new_id in order to be able to identify the conversations inside the above dataframe.



For this, I would like to create a new_id column based on three criteria:



  1. 10 minutes periods

  2. the occurrence of a keyword

  3. when a user doesnt have more textblobs

The expected output looks like this (*):



timestamp userID textBlob new_id
2018-10-05 23:07:02 01 a large text blob... 001
2018-10-05 23:07:13 01 a large text blob... 001
2018-10-05 23:07:23 01 a large text blob... 001
2018-10-05 23:07:36 01 a large text blob... 001
2018-10-05 23:08:02 01 a large text blob... 001
2018-10-05 23:09:16 01 a large text blob... 001
2018-10-05 23:09:21 01 a large text blob... 001
2018-10-05 23:09:39 01 a large text blob... 001
2018-10-05 23:09:47 01 a large text blob... 001
2018-10-05 23:10:01 01 a large text blob... 001
2018-10-05 23:10:11 01 a large text blob... 001
2018-10-05 23:10:23 01 restart 001 ---- (The word restart appeared so a new id is created ↓)
2018-10-05 23:10:59 01 a large text blob... 002
2018-10-05 23:11:03 01 a large text blob... 002
2018-10-08 23:11:32 02 a large text blob... 002
2018-10-08 23:12:58 02 a large text blob... 002
2018-10-08 23:13:16 02 a large text blob... 002
2018-10-08 23:14:04 02 a large text blob... 002 --- (The conversation ends because the 10 minutes time threshold was exceeded)
2018-10-08 03:38:36 02 a large text blob... 003
2018-10-08 03:38:42 02 a large text blob... 003
2018-10-08 03:38:52 02 a large text blob... 003
2018-10-08 03:38:57 02 a large text blob... 003
2018-10-08 03:39:10 02 a large text blob... 003
2018-10-08 03:39:27 02 Restart 003 ---- (The word restart appeared so a new id is created ↓)
2018-10-08 03:40:47 02 a large text blob... 004
2018-10-08 03:40:54 02 a large text blob... 004
2018-10-08 03:41:02 02 a large text blob... 004
2018-10-08 03:41:12 02 a large text blob... 004
2018-10-08 03:41:32 02 a large text blob... 004
2018-10-08 03:41:39 02 a large text blob... 004
2018-10-08 03:42:20 02 a large text blob... 004
2018-10-08 03:44:58 02 a large text blob... 004
2018-10-08 03:45:54 02 a large text blob... 004
2018-10-08 03:46:06 02 a large text blob... 004 ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-08 05:06:42 03 a large text blob... 005
2018-10-08 05:06:53 03 a large text blob... 005
2018-10-08 05:08:49 03 a large text blob... 005
2018-10-08 05:08:58 03 a large text blob... 005 ---- (no more conversations from user id 03, thus the a new id is assigned)
2018-10-08 05:58:18 04 a large text blob... 006
2018-10-08 05:58:26 04 a large text blob... 006
2018-10-08 05:58:37 04 a large text blob... 006
2018-10-08 05:58:58 04 a large text blob... 006
2018-10-08 06:00:31 04 a large text blob... 006
2018-10-08 06:01:00 04 a large text blob... 006
2018-10-08 06:01:14 04 a large text blob... 006
2018-10-08 06:02:03 04 a large text blob... 006 ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-08 06:02:03 04 a large text blob... 007
2018-10-08 06:06:03 04 a large text blob... 007
2018-10-08 06:10:00 04 a large text blob... 007
2018-10-08 09:07:03 04 a large text blob... 007
2018-10-08 09:09:03 04 a large text blob... 007 ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-09 10:01:00 04 a large text blob... 008
2018-10-09 10:02:00 04 a large text blob... 008
2018-10-09 10:03:00 04 a large text blob... 008
2018-10-09 10:09:00 04 a large text blob... 008 ---- (no more conversations from user id 04, thus the a new id is assigned)
2018-10-09 10:09:00 05 a large text blob... 010


So far I tried to:



searchfor = ['restart','Restart']
df['keyword_id'] = df['textBlob'].str.contains('|'.join(searchfor))


And



dif = df['timestamp'] - df['timestamp'].shift()
periods = dif > pd.Timedelta('10 min')
times = periods.cumsum().apply(lambda x: x+1)
df['time_id'] = times


However, I also need to consider the userID and I end up with several columns. Is there any way of fulfilling the three conditions and getting the expected output (*)?










share|improve this question



















  • 1




    Should there be a new id assigned between the lines 2018-10-05 23:11:03 and 2018-10-08 03:11:32, since userID changes from 01 to 02? Also, why does the new ID jump from 005 to 007?
    – Peter Leimbigler
    Nov 11 at 16:42






  • 1




    Thanks for the help @PeterLeimbigler, no I made a mistake while generating the data.. I fixed it check again
    – tumbleweed
    Nov 11 at 16:48










  • thanks. There is now a jump from 008 to 010, and the lines I mentioned above still don't have a userID increment.
    – Peter Leimbigler
    Nov 11 at 17:23














up vote
3
down vote

favorite
1












I have a dataframe with conversations and timestamps like this:



timestamp userID textBlob new_id
2018-10-05 23:07:02 01 a large text blob...
2018-10-05 23:07:13 01 a large text blob...
2018-10-05 23:07:23 01 a large text blob...
2018-10-05 23:07:36 01 a large text blob...
2018-10-05 23:08:02 01 a large text blob...
2018-10-05 23:09:16 01 a large text blob...
2018-10-05 23:09:21 01 a large text blob...
2018-10-05 23:09:39 01 a large text blob...
2018-10-05 23:09:47 01 a large text blob...
2018-10-05 23:10:01 01 a large text blob...
2018-10-05 23:10:11 01 a large text blob...
2018-10-05 23:10:23 01 restart
2018-10-05 23:10:59 01 a large text blob...
2018-10-05 23:11:03 01 a large text blob...
2018-10-08 23:11:32 02 a large text blob...
2018-10-08 23:12:58 02 a large text blob...
2018-10-08 23:13:16 02 a large text blob...
2018-10-08 23:14:04 02 a large text blob...
2018-10-08 03:38:36 02 a large text blob...
2018-10-08 03:38:42 02 a large text blob...
2018-10-08 03:38:52 02 a large text blob...
2018-10-08 03:38:57 02 a large text blob...
2018-10-08 03:39:10 02 a large text blob...
2018-10-08 03:39:27 02 Restart
2018-10-08 03:40:47 02 a large text blob...
2018-10-08 03:40:54 02 a large text blob...
2018-10-08 03:41:02 02 a large text blob...
2018-10-08 03:41:12 02 a large text blob...
2018-10-08 03:41:32 02 a large text blob...
2018-10-08 03:41:39 02 a large text blob...
2018-10-08 03:42:20 02 a large text blob...
2018-10-08 03:44:58 02 a large text blob...
2018-10-08 03:45:54 02 a large text blob...
2018-10-08 03:46:06 02 a large text blob...
2018-10-08 05:06:42 03 a large text blob...
2018-10-08 05:06:53 03 a large text blob...
2018-10-08 05:08:49 03 a large text blob...
2018-10-08 05:08:58 03 a large text blob...
2018-10-08 05:58:18 04 a large text blob...
2018-10-08 05:58:26 04 a large text blob...
2018-10-08 05:58:37 04 a large text blob...
2018-10-08 05:58:58 04 a large text blob...
2018-10-08 06:00:31 04 a large text blob...
2018-10-08 06:01:00 04 a large text blob...
2018-10-08 06:01:14 04 a large text blob...
2018-10-08 06:02:03 04 a large text blob...
2018-10-08 06:02:03 04 a large text blob...
2018-10-08 06:06:03 04 a large text blob...
2018-10-08 06:10:00 04 a large text blob...
2018-10-08 09:07:03 04 a large text blob...
2018-10-08 09:09:03 04 a large text blob...
2018-10-09 10:01:00 04 a large text blob...
2018-10-09 10:02:00 04 a large text blob...
2018-10-09 10:03:00 04 a large text blob...
2018-10-09 10:09:00 04 a large text blob...
2018-10-09 10:09:00 05 a large text blob...


At the moment I would like to identify with an id the conversations inside the dataframe. The problem is that a user can have several conversations (i.e. an userID can have multiple textBlob associated). Thus, I would like to add a new_id in order to be able to identify the conversations inside the above dataframe.



For this, I would like to create a new_id column based on three criteria:



  1. 10 minutes periods

  2. the occurrence of a keyword

  3. when a user doesnt have more textblobs

The expected output looks like this (*):



timestamp userID textBlob new_id
2018-10-05 23:07:02 01 a large text blob... 001
2018-10-05 23:07:13 01 a large text blob... 001
2018-10-05 23:07:23 01 a large text blob... 001
2018-10-05 23:07:36 01 a large text blob... 001
2018-10-05 23:08:02 01 a large text blob... 001
2018-10-05 23:09:16 01 a large text blob... 001
2018-10-05 23:09:21 01 a large text blob... 001
2018-10-05 23:09:39 01 a large text blob... 001
2018-10-05 23:09:47 01 a large text blob... 001
2018-10-05 23:10:01 01 a large text blob... 001
2018-10-05 23:10:11 01 a large text blob... 001
2018-10-05 23:10:23 01 restart 001 ---- (The word restart appeared so a new id is created ↓)
2018-10-05 23:10:59 01 a large text blob... 002
2018-10-05 23:11:03 01 a large text blob... 002
2018-10-08 23:11:32 02 a large text blob... 002
2018-10-08 23:12:58 02 a large text blob... 002
2018-10-08 23:13:16 02 a large text blob... 002
2018-10-08 23:14:04 02 a large text blob... 002 --- (The conversation ends because the 10 minutes time threshold was exceeded)
2018-10-08 03:38:36 02 a large text blob... 003
2018-10-08 03:38:42 02 a large text blob... 003
2018-10-08 03:38:52 02 a large text blob... 003
2018-10-08 03:38:57 02 a large text blob... 003
2018-10-08 03:39:10 02 a large text blob... 003
2018-10-08 03:39:27 02 Restart 003 ---- (The word restart appeared so a new id is created ↓)
2018-10-08 03:40:47 02 a large text blob... 004
2018-10-08 03:40:54 02 a large text blob... 004
2018-10-08 03:41:02 02 a large text blob... 004
2018-10-08 03:41:12 02 a large text blob... 004
2018-10-08 03:41:32 02 a large text blob... 004
2018-10-08 03:41:39 02 a large text blob... 004
2018-10-08 03:42:20 02 a large text blob... 004
2018-10-08 03:44:58 02 a large text blob... 004
2018-10-08 03:45:54 02 a large text blob... 004
2018-10-08 03:46:06 02 a large text blob... 004 ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-08 05:06:42 03 a large text blob... 005
2018-10-08 05:06:53 03 a large text blob... 005
2018-10-08 05:08:49 03 a large text blob... 005
2018-10-08 05:08:58 03 a large text blob... 005 ---- (no more conversations from user id 03, thus the a new id is assigned)
2018-10-08 05:58:18 04 a large text blob... 006
2018-10-08 05:58:26 04 a large text blob... 006
2018-10-08 05:58:37 04 a large text blob... 006
2018-10-08 05:58:58 04 a large text blob... 006
2018-10-08 06:00:31 04 a large text blob... 006
2018-10-08 06:01:00 04 a large text blob... 006
2018-10-08 06:01:14 04 a large text blob... 006
2018-10-08 06:02:03 04 a large text blob... 006 ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-08 06:02:03 04 a large text blob... 007
2018-10-08 06:06:03 04 a large text blob... 007
2018-10-08 06:10:00 04 a large text blob... 007
2018-10-08 09:07:03 04 a large text blob... 007
2018-10-08 09:09:03 04 a large text blob... 007 ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-09 10:01:00 04 a large text blob... 008
2018-10-09 10:02:00 04 a large text blob... 008
2018-10-09 10:03:00 04 a large text blob... 008
2018-10-09 10:09:00 04 a large text blob... 008 ---- (no more conversations from user id 04, thus the a new id is assigned)
2018-10-09 10:09:00 05 a large text blob... 010


So far I tried to:



searchfor = ['restart','Restart']
df['keyword_id'] = df['textBlob'].str.contains('|'.join(searchfor))


And



dif = df['timestamp'] - df['timestamp'].shift()
periods = dif > pd.Timedelta('10 min')
times = periods.cumsum().apply(lambda x: x+1)
df['time_id'] = times


However, I also need to consider the userID and I end up with several columns. Is there any way of fulfilling the three conditions and getting the expected output (*)?










share|improve this question



















  • 1




    Should there be a new id assigned between the lines 2018-10-05 23:11:03 and 2018-10-08 03:11:32, since userID changes from 01 to 02? Also, why does the new ID jump from 005 to 007?
    – Peter Leimbigler
    Nov 11 at 16:42






  • 1




    Thanks for the help @PeterLeimbigler, no I made a mistake while generating the data.. I fixed it check again
    – tumbleweed
    Nov 11 at 16:48










  • thanks. There is now a jump from 008 to 010, and the lines I mentioned above still don't have a userID increment.
    – Peter Leimbigler
    Nov 11 at 17:23












up vote
3
down vote

favorite
1









up vote
3
down vote

favorite
1






1





I have a dataframe with conversations and timestamps like this:



timestamp userID textBlob new_id
2018-10-05 23:07:02 01 a large text blob...
2018-10-05 23:07:13 01 a large text blob...
2018-10-05 23:07:23 01 a large text blob...
2018-10-05 23:07:36 01 a large text blob...
2018-10-05 23:08:02 01 a large text blob...
2018-10-05 23:09:16 01 a large text blob...
2018-10-05 23:09:21 01 a large text blob...
2018-10-05 23:09:39 01 a large text blob...
2018-10-05 23:09:47 01 a large text blob...
2018-10-05 23:10:01 01 a large text blob...
2018-10-05 23:10:11 01 a large text blob...
2018-10-05 23:10:23 01 restart
2018-10-05 23:10:59 01 a large text blob...
2018-10-05 23:11:03 01 a large text blob...
2018-10-08 23:11:32 02 a large text blob...
2018-10-08 23:12:58 02 a large text blob...
2018-10-08 23:13:16 02 a large text blob...
2018-10-08 23:14:04 02 a large text blob...
2018-10-08 03:38:36 02 a large text blob...
2018-10-08 03:38:42 02 a large text blob...
2018-10-08 03:38:52 02 a large text blob...
2018-10-08 03:38:57 02 a large text blob...
2018-10-08 03:39:10 02 a large text blob...
2018-10-08 03:39:27 02 Restart
2018-10-08 03:40:47 02 a large text blob...
2018-10-08 03:40:54 02 a large text blob...
2018-10-08 03:41:02 02 a large text blob...
2018-10-08 03:41:12 02 a large text blob...
2018-10-08 03:41:32 02 a large text blob...
2018-10-08 03:41:39 02 a large text blob...
2018-10-08 03:42:20 02 a large text blob...
2018-10-08 03:44:58 02 a large text blob...
2018-10-08 03:45:54 02 a large text blob...
2018-10-08 03:46:06 02 a large text blob...
2018-10-08 05:06:42 03 a large text blob...
2018-10-08 05:06:53 03 a large text blob...
2018-10-08 05:08:49 03 a large text blob...
2018-10-08 05:08:58 03 a large text blob...
2018-10-08 05:58:18 04 a large text blob...
2018-10-08 05:58:26 04 a large text blob...
2018-10-08 05:58:37 04 a large text blob...
2018-10-08 05:58:58 04 a large text blob...
2018-10-08 06:00:31 04 a large text blob...
2018-10-08 06:01:00 04 a large text blob...
2018-10-08 06:01:14 04 a large text blob...
2018-10-08 06:02:03 04 a large text blob...
2018-10-08 06:02:03 04 a large text blob...
2018-10-08 06:06:03 04 a large text blob...
2018-10-08 06:10:00 04 a large text blob...
2018-10-08 09:07:03 04 a large text blob...
2018-10-08 09:09:03 04 a large text blob...
2018-10-09 10:01:00 04 a large text blob...
2018-10-09 10:02:00 04 a large text blob...
2018-10-09 10:03:00 04 a large text blob...
2018-10-09 10:09:00 04 a large text blob...
2018-10-09 10:09:00 05 a large text blob...


At the moment I would like to identify with an id the conversations inside the dataframe. The problem is that a user can have several conversations (i.e. an userID can have multiple textBlob associated). Thus, I would like to add a new_id in order to be able to identify the conversations inside the above dataframe.



For this, I would like to create a new_id column based on three criteria:



  1. 10 minutes periods

  2. the occurrence of a keyword

  3. when a user doesnt have more textblobs

The expected output looks like this (*):



timestamp userID textBlob new_id
2018-10-05 23:07:02 01 a large text blob... 001
2018-10-05 23:07:13 01 a large text blob... 001
2018-10-05 23:07:23 01 a large text blob... 001
2018-10-05 23:07:36 01 a large text blob... 001
2018-10-05 23:08:02 01 a large text blob... 001
2018-10-05 23:09:16 01 a large text blob... 001
2018-10-05 23:09:21 01 a large text blob... 001
2018-10-05 23:09:39 01 a large text blob... 001
2018-10-05 23:09:47 01 a large text blob... 001
2018-10-05 23:10:01 01 a large text blob... 001
2018-10-05 23:10:11 01 a large text blob... 001
2018-10-05 23:10:23 01 restart 001 ---- (The word restart appeared so a new id is created ↓)
2018-10-05 23:10:59 01 a large text blob... 002
2018-10-05 23:11:03 01 a large text blob... 002
2018-10-08 23:11:32 02 a large text blob... 002
2018-10-08 23:12:58 02 a large text blob... 002
2018-10-08 23:13:16 02 a large text blob... 002
2018-10-08 23:14:04 02 a large text blob... 002 --- (The conversation ends because the 10 minutes time threshold was exceeded)
2018-10-08 03:38:36 02 a large text blob... 003
2018-10-08 03:38:42 02 a large text blob... 003
2018-10-08 03:38:52 02 a large text blob... 003
2018-10-08 03:38:57 02 a large text blob... 003
2018-10-08 03:39:10 02 a large text blob... 003
2018-10-08 03:39:27 02 Restart 003 ---- (The word restart appeared so a new id is created ↓)
2018-10-08 03:40:47 02 a large text blob... 004
2018-10-08 03:40:54 02 a large text blob... 004
2018-10-08 03:41:02 02 a large text blob... 004
2018-10-08 03:41:12 02 a large text blob... 004
2018-10-08 03:41:32 02 a large text blob... 004
2018-10-08 03:41:39 02 a large text blob... 004
2018-10-08 03:42:20 02 a large text blob... 004
2018-10-08 03:44:58 02 a large text blob... 004
2018-10-08 03:45:54 02 a large text blob... 004
2018-10-08 03:46:06 02 a large text blob... 004 ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-08 05:06:42 03 a large text blob... 005
2018-10-08 05:06:53 03 a large text blob... 005
2018-10-08 05:08:49 03 a large text blob... 005
2018-10-08 05:08:58 03 a large text blob... 005 ---- (no more conversations from user id 03, thus the a new id is assigned)
2018-10-08 05:58:18 04 a large text blob... 006
2018-10-08 05:58:26 04 a large text blob... 006
2018-10-08 05:58:37 04 a large text blob... 006
2018-10-08 05:58:58 04 a large text blob... 006
2018-10-08 06:00:31 04 a large text blob... 006
2018-10-08 06:01:00 04 a large text blob... 006
2018-10-08 06:01:14 04 a large text blob... 006
2018-10-08 06:02:03 04 a large text blob... 006 ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-08 06:02:03 04 a large text blob... 007
2018-10-08 06:06:03 04 a large text blob... 007
2018-10-08 06:10:00 04 a large text blob... 007
2018-10-08 09:07:03 04 a large text blob... 007
2018-10-08 09:09:03 04 a large text blob... 007 ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-09 10:01:00 04 a large text blob... 008
2018-10-09 10:02:00 04 a large text blob... 008
2018-10-09 10:03:00 04 a large text blob... 008
2018-10-09 10:09:00 04 a large text blob... 008 ---- (no more conversations from user id 04, thus the a new id is assigned)
2018-10-09 10:09:00 05 a large text blob... 010


So far I tried to:



searchfor = ['restart','Restart']
df['keyword_id'] = df['textBlob'].str.contains('|'.join(searchfor))


And



dif = df['timestamp'] - df['timestamp'].shift()
periods = dif > pd.Timedelta('10 min')
times = periods.cumsum().apply(lambda x: x+1)
df['time_id'] = times


However, I also need to consider the userID and I end up with several columns. Is there any way of fulfilling the three conditions and getting the expected output (*)?










share|improve this question















I have a dataframe with conversations and timestamps like this:



timestamp userID textBlob new_id
2018-10-05 23:07:02 01 a large text blob...
2018-10-05 23:07:13 01 a large text blob...
2018-10-05 23:07:23 01 a large text blob...
2018-10-05 23:07:36 01 a large text blob...
2018-10-05 23:08:02 01 a large text blob...
2018-10-05 23:09:16 01 a large text blob...
2018-10-05 23:09:21 01 a large text blob...
2018-10-05 23:09:39 01 a large text blob...
2018-10-05 23:09:47 01 a large text blob...
2018-10-05 23:10:01 01 a large text blob...
2018-10-05 23:10:11 01 a large text blob...
2018-10-05 23:10:23 01 restart
2018-10-05 23:10:59 01 a large text blob...
2018-10-05 23:11:03 01 a large text blob...
2018-10-08 23:11:32 02 a large text blob...
2018-10-08 23:12:58 02 a large text blob...
2018-10-08 23:13:16 02 a large text blob...
2018-10-08 23:14:04 02 a large text blob...
2018-10-08 03:38:36 02 a large text blob...
2018-10-08 03:38:42 02 a large text blob...
2018-10-08 03:38:52 02 a large text blob...
2018-10-08 03:38:57 02 a large text blob...
2018-10-08 03:39:10 02 a large text blob...
2018-10-08 03:39:27 02 Restart
2018-10-08 03:40:47 02 a large text blob...
2018-10-08 03:40:54 02 a large text blob...
2018-10-08 03:41:02 02 a large text blob...
2018-10-08 03:41:12 02 a large text blob...
2018-10-08 03:41:32 02 a large text blob...
2018-10-08 03:41:39 02 a large text blob...
2018-10-08 03:42:20 02 a large text blob...
2018-10-08 03:44:58 02 a large text blob...
2018-10-08 03:45:54 02 a large text blob...
2018-10-08 03:46:06 02 a large text blob...
2018-10-08 05:06:42 03 a large text blob...
2018-10-08 05:06:53 03 a large text blob...
2018-10-08 05:08:49 03 a large text blob...
2018-10-08 05:08:58 03 a large text blob...
2018-10-08 05:58:18 04 a large text blob...
2018-10-08 05:58:26 04 a large text blob...
2018-10-08 05:58:37 04 a large text blob...
2018-10-08 05:58:58 04 a large text blob...
2018-10-08 06:00:31 04 a large text blob...
2018-10-08 06:01:00 04 a large text blob...
2018-10-08 06:01:14 04 a large text blob...
2018-10-08 06:02:03 04 a large text blob...
2018-10-08 06:02:03 04 a large text blob...
2018-10-08 06:06:03 04 a large text blob...
2018-10-08 06:10:00 04 a large text blob...
2018-10-08 09:07:03 04 a large text blob...
2018-10-08 09:09:03 04 a large text blob...
2018-10-09 10:01:00 04 a large text blob...
2018-10-09 10:02:00 04 a large text blob...
2018-10-09 10:03:00 04 a large text blob...
2018-10-09 10:09:00 04 a large text blob...
2018-10-09 10:09:00 05 a large text blob...


At the moment I would like to identify with an id the conversations inside the dataframe. The problem is that a user can have several conversations (i.e. an userID can have multiple textBlob associated). Thus, I would like to add a new_id in order to be able to identify the conversations inside the above dataframe.



For this, I would like to create a new_id column based on three criteria:



  1. 10 minutes periods

  2. the occurrence of a keyword

  3. when a user doesnt have more textblobs

The expected output looks like this (*):



timestamp userID textBlob new_id
2018-10-05 23:07:02 01 a large text blob... 001
2018-10-05 23:07:13 01 a large text blob... 001
2018-10-05 23:07:23 01 a large text blob... 001
2018-10-05 23:07:36 01 a large text blob... 001
2018-10-05 23:08:02 01 a large text blob... 001
2018-10-05 23:09:16 01 a large text blob... 001
2018-10-05 23:09:21 01 a large text blob... 001
2018-10-05 23:09:39 01 a large text blob... 001
2018-10-05 23:09:47 01 a large text blob... 001
2018-10-05 23:10:01 01 a large text blob... 001
2018-10-05 23:10:11 01 a large text blob... 001
2018-10-05 23:10:23 01 restart 001 ---- (The word restart appeared so a new id is created ↓)
2018-10-05 23:10:59 01 a large text blob... 002
2018-10-05 23:11:03 01 a large text blob... 002
2018-10-08 23:11:32 02 a large text blob... 002
2018-10-08 23:12:58 02 a large text blob... 002
2018-10-08 23:13:16 02 a large text blob... 002
2018-10-08 23:14:04 02 a large text blob... 002 --- (The conversation ends because the 10 minutes time threshold was exceeded)
2018-10-08 03:38:36 02 a large text blob... 003
2018-10-08 03:38:42 02 a large text blob... 003
2018-10-08 03:38:52 02 a large text blob... 003
2018-10-08 03:38:57 02 a large text blob... 003
2018-10-08 03:39:10 02 a large text blob... 003
2018-10-08 03:39:27 02 Restart 003 ---- (The word restart appeared so a new id is created ↓)
2018-10-08 03:40:47 02 a large text blob... 004
2018-10-08 03:40:54 02 a large text blob... 004
2018-10-08 03:41:02 02 a large text blob... 004
2018-10-08 03:41:12 02 a large text blob... 004
2018-10-08 03:41:32 02 a large text blob... 004
2018-10-08 03:41:39 02 a large text blob... 004
2018-10-08 03:42:20 02 a large text blob... 004
2018-10-08 03:44:58 02 a large text blob... 004
2018-10-08 03:45:54 02 a large text blob... 004
2018-10-08 03:46:06 02 a large text blob... 004 ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-08 05:06:42 03 a large text blob... 005
2018-10-08 05:06:53 03 a large text blob... 005
2018-10-08 05:08:49 03 a large text blob... 005
2018-10-08 05:08:58 03 a large text blob... 005 ---- (no more conversations from user id 03, thus the a new id is assigned)
2018-10-08 05:58:18 04 a large text blob... 006
2018-10-08 05:58:26 04 a large text blob... 006
2018-10-08 05:58:37 04 a large text blob... 006
2018-10-08 05:58:58 04 a large text blob... 006
2018-10-08 06:00:31 04 a large text blob... 006
2018-10-08 06:01:00 04 a large text blob... 006
2018-10-08 06:01:14 04 a large text blob... 006
2018-10-08 06:02:03 04 a large text blob... 006 ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-08 06:02:03 04 a large text blob... 007
2018-10-08 06:06:03 04 a large text blob... 007
2018-10-08 06:10:00 04 a large text blob... 007
2018-10-08 09:07:03 04 a large text blob... 007
2018-10-08 09:09:03 04 a large text blob... 007 ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-09 10:01:00 04 a large text blob... 008
2018-10-09 10:02:00 04 a large text blob... 008
2018-10-09 10:03:00 04 a large text blob... 008
2018-10-09 10:09:00 04 a large text blob... 008 ---- (no more conversations from user id 04, thus the a new id is assigned)
2018-10-09 10:09:00 05 a large text blob... 010


So far I tried to:



searchfor = ['restart','Restart']
df['keyword_id'] = df['textBlob'].str.contains('|'.join(searchfor))


And



dif = df['timestamp'] - df['timestamp'].shift()
periods = dif > pd.Timedelta('10 min')
times = periods.cumsum().apply(lambda x: x+1)
df['time_id'] = times


However, I also need to consider the userID and I end up with several columns. Is there any way of fulfilling the three conditions and getting the expected output (*)?







python pandas dataframe






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 11 at 16:51

























asked Nov 11 at 11:30









tumbleweed

91682353




91682353







  • 1




    Should there be a new id assigned between the lines 2018-10-05 23:11:03 and 2018-10-08 03:11:32, since userID changes from 01 to 02? Also, why does the new ID jump from 005 to 007?
    – Peter Leimbigler
    Nov 11 at 16:42






  • 1




    Thanks for the help @PeterLeimbigler, no I made a mistake while generating the data.. I fixed it check again
    – tumbleweed
    Nov 11 at 16:48










  • thanks. There is now a jump from 008 to 010, and the lines I mentioned above still don't have a userID increment.
    – Peter Leimbigler
    Nov 11 at 17:23












  • 1




    Should there be a new id assigned between the lines 2018-10-05 23:11:03 and 2018-10-08 03:11:32, since userID changes from 01 to 02? Also, why does the new ID jump from 005 to 007?
    – Peter Leimbigler
    Nov 11 at 16:42






  • 1




    Thanks for the help @PeterLeimbigler, no I made a mistake while generating the data.. I fixed it check again
    – tumbleweed
    Nov 11 at 16:48










  • thanks. There is now a jump from 008 to 010, and the lines I mentioned above still don't have a userID increment.
    – Peter Leimbigler
    Nov 11 at 17:23







1




1




Should there be a new id assigned between the lines 2018-10-05 23:11:03 and 2018-10-08 03:11:32, since userID changes from 01 to 02? Also, why does the new ID jump from 005 to 007?
– Peter Leimbigler
Nov 11 at 16:42




Should there be a new id assigned between the lines 2018-10-05 23:11:03 and 2018-10-08 03:11:32, since userID changes from 01 to 02? Also, why does the new ID jump from 005 to 007?
– Peter Leimbigler
Nov 11 at 16:42




1




1




Thanks for the help @PeterLeimbigler, no I made a mistake while generating the data.. I fixed it check again
– tumbleweed
Nov 11 at 16:48




Thanks for the help @PeterLeimbigler, no I made a mistake while generating the data.. I fixed it check again
– tumbleweed
Nov 11 at 16:48












thanks. There is now a jump from 008 to 010, and the lines I mentioned above still don't have a userID increment.
– Peter Leimbigler
Nov 11 at 17:23




thanks. There is now a jump from 008 to 010, and the lines I mentioned above still don't have a userID increment.
– Peter Leimbigler
Nov 11 at 17:23












2 Answers
2






active

oldest

votes

















up vote
1
down vote



accepted










You're most of the way there. To put it all together, build a boolean mask for each condition, then convert the masks to int and take their cumulative sum:



mask1 = df.timestamp.diff() > pd.Timedelta(10, 'm') 
mask2 = df['userID'].diff() != 0
mask3 = df['textBlob'].shift().str.lower() == 'restart'

df['new_id'] = (mask1 | mask2 | mask3).astype(int).cumsum()

# Result:
print(df.to_string(index=False))

timestamp userID textBlob new_id
2018-10-05 23:07:02 1 a_large_text_blob... 1
2018-10-05 23:07:13 1 a_large_text_blob... 1
2018-10-05 23:07:23 1 a_large_text_blob... 1
2018-10-05 23:07:36 1 a_large_text_blob... 1
2018-10-05 23:08:02 1 a_large_text_blob... 1
2018-10-05 23:09:16 1 a_large_text_blob... 1
2018-10-05 23:09:21 1 a_large_text_blob... 1
2018-10-05 23:09:39 1 a_large_text_blob... 1
2018-10-05 23:09:47 1 a_large_text_blob... 1
2018-10-05 23:10:01 1 a_large_text_blob... 1
2018-10-05 23:10:11 1 a_large_text_blob... 1
2018-10-05 23:10:23 1 restart 1
2018-10-05 23:10:59 1 a_large_text_blob... 2
2018-10-05 23:11:03 1 a_large_text_blob... 2
2018-10-08 03:11:32 2 a_large_text_blob... 3
2018-10-08 03:12:58 2 a_large_text_blob... 3
2018-10-08 03:13:16 2 a_large_text_blob... 3
2018-10-08 03:14:04 2 a_large_text_blob... 3
2018-10-08 03:38:36 2 a_large_text_blob... 4
2018-10-08 03:38:42 2 a_large_text_blob... 4
2018-10-08 03:38:52 2 a_large_text_blob... 4
2018-10-08 03:38:57 2 a_large_text_blob... 4
2018-10-08 03:39:10 2 a_large_text_blob... 4
2018-10-08 03:39:27 2 Restart 4
2018-10-08 03:40:47 2 a_large_text_blob... 5
2018-10-08 03:40:54 2 a_large_text_blob... 5
2018-10-08 03:41:02 2 a_large_text_blob... 5
2018-10-08 03:41:12 2 a_large_text_blob... 5
2018-10-08 03:41:32 2 a_large_text_blob... 5
2018-10-08 03:41:39 2 a_large_text_blob... 5
2018-10-08 03:42:20 2 a_large_text_blob... 5
2018-10-08 03:44:58 2 a_large_text_blob... 5
2018-10-08 03:45:54 2 a_large_text_blob... 5
2018-10-08 03:46:06 2 a_large_text_blob... 5
2018-10-08 05:06:42 3 a_large_text_blob... 6
2018-10-08 05:06:53 3 a_large_text_blob... 6
2018-10-08 05:08:49 3 a_large_text_blob... 6
2018-10-08 05:08:58 3 a_large_text_blob... 6
2018-10-08 05:58:18 4 a_large_text_blob... 7
2018-10-08 05:58:26 4 a_large_text_blob... 7
2018-10-08 05:58:37 4 a_large_text_blob... 7
2018-10-08 05:58:58 4 a_large_text_blob... 7
2018-10-08 06:00:31 4 a_large_text_blob... 7
2018-10-08 06:01:00 4 a_large_text_blob... 7
2018-10-08 06:01:14 4 a_large_text_blob... 7
2018-10-08 06:02:03 4 a_large_text_blob... 7
2018-10-08 06:02:03 4 a_large_text_blob... 7
2018-10-08 06:06:03 4 a_large_text_blob... 7
2018-10-08 06:10:00 4 a_large_text_blob... 7
2018-10-08 09:07:03 4 a_large_text_blob... 8
2018-10-08 09:09:03 4 a_large_text_blob... 8
2018-10-09 10:01:00 4 a_large_text_blob... 9
2018-10-09 10:02:00 4 a_large_text_blob... 9
2018-10-09 10:03:00 4 a_large_text_blob... 9
2018-10-09 10:09:00 4 a_large_text_blob... 9
2018-10-09 10:09:00 5 a_large_text_blob... 10





share|improve this answer






















  • I updated the data, thanks for the help
    – tumbleweed
    Nov 11 at 16:53






  • 1




    @tumbleweed, I've adjusted the restart logic, so this output should be what you're looking for. Let me know if I'm still missing anything.
    – Peter Leimbigler
    Nov 11 at 17:24

















up vote
0
down vote













Ok I thought the 10 minutes period should count from the beginning of the conversation, not from the immediate below message, in that case you would need to iterate over the rows like:



df['timestamp'] = pd.to_datetime(df['timestamp'])
restart = df.textBlob.str.contains('|'.join(['restart','Restart']))
user_change = df.userID == df.userID.shift().fillna(method='bfill')
df['new_id'] = (restart | ~user_change).cumsum()
current_id = 0
new_id_prev = 0
start_time = df.timestamp.iloc[0]

for i, new_id, timestamp in zip(range(len(df)), df.new_id, df.timestamp):
timedelta = timestamp - start_time

if new_id != new_id_prev or timedelta > pd.Timedelta(10,unit='m'):
current_id += 1
start_time = timestamp

new_id_prev = new_id
df.new_id.iloc[i] = current_id





share|improve this answer
















  • 1




    Thanks for the help!
    – tumbleweed
    Nov 11 at 17:54










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53248280%2fproblems-while-trying-to-create-a-new-id-column-based-on-three-criteria%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
1
down vote



accepted










You're most of the way there. To put it all together, build a boolean mask for each condition, then convert the masks to int and take their cumulative sum:



mask1 = df.timestamp.diff() > pd.Timedelta(10, 'm') 
mask2 = df['userID'].diff() != 0
mask3 = df['textBlob'].shift().str.lower() == 'restart'

df['new_id'] = (mask1 | mask2 | mask3).astype(int).cumsum()

# Result:
print(df.to_string(index=False))

timestamp userID textBlob new_id
2018-10-05 23:07:02 1 a_large_text_blob... 1
2018-10-05 23:07:13 1 a_large_text_blob... 1
2018-10-05 23:07:23 1 a_large_text_blob... 1
2018-10-05 23:07:36 1 a_large_text_blob... 1
2018-10-05 23:08:02 1 a_large_text_blob... 1
2018-10-05 23:09:16 1 a_large_text_blob... 1
2018-10-05 23:09:21 1 a_large_text_blob... 1
2018-10-05 23:09:39 1 a_large_text_blob... 1
2018-10-05 23:09:47 1 a_large_text_blob... 1
2018-10-05 23:10:01 1 a_large_text_blob... 1
2018-10-05 23:10:11 1 a_large_text_blob... 1
2018-10-05 23:10:23 1 restart 1
2018-10-05 23:10:59 1 a_large_text_blob... 2
2018-10-05 23:11:03 1 a_large_text_blob... 2
2018-10-08 03:11:32 2 a_large_text_blob... 3
2018-10-08 03:12:58 2 a_large_text_blob... 3
2018-10-08 03:13:16 2 a_large_text_blob... 3
2018-10-08 03:14:04 2 a_large_text_blob... 3
2018-10-08 03:38:36 2 a_large_text_blob... 4
2018-10-08 03:38:42 2 a_large_text_blob... 4
2018-10-08 03:38:52 2 a_large_text_blob... 4
2018-10-08 03:38:57 2 a_large_text_blob... 4
2018-10-08 03:39:10 2 a_large_text_blob... 4
2018-10-08 03:39:27 2 Restart 4
2018-10-08 03:40:47 2 a_large_text_blob... 5
2018-10-08 03:40:54 2 a_large_text_blob... 5
2018-10-08 03:41:02 2 a_large_text_blob... 5
2018-10-08 03:41:12 2 a_large_text_blob... 5
2018-10-08 03:41:32 2 a_large_text_blob... 5
2018-10-08 03:41:39 2 a_large_text_blob... 5
2018-10-08 03:42:20 2 a_large_text_blob... 5
2018-10-08 03:44:58 2 a_large_text_blob... 5
2018-10-08 03:45:54 2 a_large_text_blob... 5
2018-10-08 03:46:06 2 a_large_text_blob... 5
2018-10-08 05:06:42 3 a_large_text_blob... 6
2018-10-08 05:06:53 3 a_large_text_blob... 6
2018-10-08 05:08:49 3 a_large_text_blob... 6
2018-10-08 05:08:58 3 a_large_text_blob... 6
2018-10-08 05:58:18 4 a_large_text_blob... 7
2018-10-08 05:58:26 4 a_large_text_blob... 7
2018-10-08 05:58:37 4 a_large_text_blob... 7
2018-10-08 05:58:58 4 a_large_text_blob... 7
2018-10-08 06:00:31 4 a_large_text_blob... 7
2018-10-08 06:01:00 4 a_large_text_blob... 7
2018-10-08 06:01:14 4 a_large_text_blob... 7
2018-10-08 06:02:03 4 a_large_text_blob... 7
2018-10-08 06:02:03 4 a_large_text_blob... 7
2018-10-08 06:06:03 4 a_large_text_blob... 7
2018-10-08 06:10:00 4 a_large_text_blob... 7
2018-10-08 09:07:03 4 a_large_text_blob... 8
2018-10-08 09:09:03 4 a_large_text_blob... 8
2018-10-09 10:01:00 4 a_large_text_blob... 9
2018-10-09 10:02:00 4 a_large_text_blob... 9
2018-10-09 10:03:00 4 a_large_text_blob... 9
2018-10-09 10:09:00 4 a_large_text_blob... 9
2018-10-09 10:09:00 5 a_large_text_blob... 10





share|improve this answer






















  • I updated the data, thanks for the help
    – tumbleweed
    Nov 11 at 16:53






  • 1




    @tumbleweed, I've adjusted the restart logic, so this output should be what you're looking for. Let me know if I'm still missing anything.
    – Peter Leimbigler
    Nov 11 at 17:24














up vote
1
down vote



accepted










You're most of the way there. To put it all together, build a boolean mask for each condition, then convert the masks to int and take their cumulative sum:



mask1 = df.timestamp.diff() > pd.Timedelta(10, 'm') 
mask2 = df['userID'].diff() != 0
mask3 = df['textBlob'].shift().str.lower() == 'restart'

df['new_id'] = (mask1 | mask2 | mask3).astype(int).cumsum()

# Result:
print(df.to_string(index=False))

timestamp userID textBlob new_id
2018-10-05 23:07:02 1 a_large_text_blob... 1
2018-10-05 23:07:13 1 a_large_text_blob... 1
2018-10-05 23:07:23 1 a_large_text_blob... 1
2018-10-05 23:07:36 1 a_large_text_blob... 1
2018-10-05 23:08:02 1 a_large_text_blob... 1
2018-10-05 23:09:16 1 a_large_text_blob... 1
2018-10-05 23:09:21 1 a_large_text_blob... 1
2018-10-05 23:09:39 1 a_large_text_blob... 1
2018-10-05 23:09:47 1 a_large_text_blob... 1
2018-10-05 23:10:01 1 a_large_text_blob... 1
2018-10-05 23:10:11 1 a_large_text_blob... 1
2018-10-05 23:10:23 1 restart 1
2018-10-05 23:10:59 1 a_large_text_blob... 2
2018-10-05 23:11:03 1 a_large_text_blob... 2
2018-10-08 03:11:32 2 a_large_text_blob... 3
2018-10-08 03:12:58 2 a_large_text_blob... 3
2018-10-08 03:13:16 2 a_large_text_blob... 3
2018-10-08 03:14:04 2 a_large_text_blob... 3
2018-10-08 03:38:36 2 a_large_text_blob... 4
2018-10-08 03:38:42 2 a_large_text_blob... 4
2018-10-08 03:38:52 2 a_large_text_blob... 4
2018-10-08 03:38:57 2 a_large_text_blob... 4
2018-10-08 03:39:10 2 a_large_text_blob... 4
2018-10-08 03:39:27 2 Restart 4
2018-10-08 03:40:47 2 a_large_text_blob... 5
2018-10-08 03:40:54 2 a_large_text_blob... 5
2018-10-08 03:41:02 2 a_large_text_blob... 5
2018-10-08 03:41:12 2 a_large_text_blob... 5
2018-10-08 03:41:32 2 a_large_text_blob... 5
2018-10-08 03:41:39 2 a_large_text_blob... 5
2018-10-08 03:42:20 2 a_large_text_blob... 5
2018-10-08 03:44:58 2 a_large_text_blob... 5
2018-10-08 03:45:54 2 a_large_text_blob... 5
2018-10-08 03:46:06 2 a_large_text_blob... 5
2018-10-08 05:06:42 3 a_large_text_blob... 6
2018-10-08 05:06:53 3 a_large_text_blob... 6
2018-10-08 05:08:49 3 a_large_text_blob... 6
2018-10-08 05:08:58 3 a_large_text_blob... 6
2018-10-08 05:58:18 4 a_large_text_blob... 7
2018-10-08 05:58:26 4 a_large_text_blob... 7
2018-10-08 05:58:37 4 a_large_text_blob... 7
2018-10-08 05:58:58 4 a_large_text_blob... 7
2018-10-08 06:00:31 4 a_large_text_blob... 7
2018-10-08 06:01:00 4 a_large_text_blob... 7
2018-10-08 06:01:14 4 a_large_text_blob... 7
2018-10-08 06:02:03 4 a_large_text_blob... 7
2018-10-08 06:02:03 4 a_large_text_blob... 7
2018-10-08 06:06:03 4 a_large_text_blob... 7
2018-10-08 06:10:00 4 a_large_text_blob... 7
2018-10-08 09:07:03 4 a_large_text_blob... 8
2018-10-08 09:09:03 4 a_large_text_blob... 8
2018-10-09 10:01:00 4 a_large_text_blob... 9
2018-10-09 10:02:00 4 a_large_text_blob... 9
2018-10-09 10:03:00 4 a_large_text_blob... 9
2018-10-09 10:09:00 4 a_large_text_blob... 9
2018-10-09 10:09:00 5 a_large_text_blob... 10





share|improve this answer






















  • I updated the data, thanks for the help
    – tumbleweed
    Nov 11 at 16:53






  • 1




    @tumbleweed, I've adjusted the restart logic, so this output should be what you're looking for. Let me know if I'm still missing anything.
    – Peter Leimbigler
    Nov 11 at 17:24












up vote
1
down vote



accepted







up vote
1
down vote



accepted






You're most of the way there. To put it all together, build a boolean mask for each condition, then convert the masks to int and take their cumulative sum:



mask1 = df.timestamp.diff() > pd.Timedelta(10, 'm') 
mask2 = df['userID'].diff() != 0
mask3 = df['textBlob'].shift().str.lower() == 'restart'

df['new_id'] = (mask1 | mask2 | mask3).astype(int).cumsum()

# Result:
print(df.to_string(index=False))

timestamp userID textBlob new_id
2018-10-05 23:07:02 1 a_large_text_blob... 1
2018-10-05 23:07:13 1 a_large_text_blob... 1
2018-10-05 23:07:23 1 a_large_text_blob... 1
2018-10-05 23:07:36 1 a_large_text_blob... 1
2018-10-05 23:08:02 1 a_large_text_blob... 1
2018-10-05 23:09:16 1 a_large_text_blob... 1
2018-10-05 23:09:21 1 a_large_text_blob... 1
2018-10-05 23:09:39 1 a_large_text_blob... 1
2018-10-05 23:09:47 1 a_large_text_blob... 1
2018-10-05 23:10:01 1 a_large_text_blob... 1
2018-10-05 23:10:11 1 a_large_text_blob... 1
2018-10-05 23:10:23 1 restart 1
2018-10-05 23:10:59 1 a_large_text_blob... 2
2018-10-05 23:11:03 1 a_large_text_blob... 2
2018-10-08 03:11:32 2 a_large_text_blob... 3
2018-10-08 03:12:58 2 a_large_text_blob... 3
2018-10-08 03:13:16 2 a_large_text_blob... 3
2018-10-08 03:14:04 2 a_large_text_blob... 3
2018-10-08 03:38:36 2 a_large_text_blob... 4
2018-10-08 03:38:42 2 a_large_text_blob... 4
2018-10-08 03:38:52 2 a_large_text_blob... 4
2018-10-08 03:38:57 2 a_large_text_blob... 4
2018-10-08 03:39:10 2 a_large_text_blob... 4
2018-10-08 03:39:27 2 Restart 4
2018-10-08 03:40:47 2 a_large_text_blob... 5
2018-10-08 03:40:54 2 a_large_text_blob... 5
2018-10-08 03:41:02 2 a_large_text_blob... 5
2018-10-08 03:41:12 2 a_large_text_blob... 5
2018-10-08 03:41:32 2 a_large_text_blob... 5
2018-10-08 03:41:39 2 a_large_text_blob... 5
2018-10-08 03:42:20 2 a_large_text_blob... 5
2018-10-08 03:44:58 2 a_large_text_blob... 5
2018-10-08 03:45:54 2 a_large_text_blob... 5
2018-10-08 03:46:06 2 a_large_text_blob... 5
2018-10-08 05:06:42 3 a_large_text_blob... 6
2018-10-08 05:06:53 3 a_large_text_blob... 6
2018-10-08 05:08:49 3 a_large_text_blob... 6
2018-10-08 05:08:58 3 a_large_text_blob... 6
2018-10-08 05:58:18 4 a_large_text_blob... 7
2018-10-08 05:58:26 4 a_large_text_blob... 7
2018-10-08 05:58:37 4 a_large_text_blob... 7
2018-10-08 05:58:58 4 a_large_text_blob... 7
2018-10-08 06:00:31 4 a_large_text_blob... 7
2018-10-08 06:01:00 4 a_large_text_blob... 7
2018-10-08 06:01:14 4 a_large_text_blob... 7
2018-10-08 06:02:03 4 a_large_text_blob... 7
2018-10-08 06:02:03 4 a_large_text_blob... 7
2018-10-08 06:06:03 4 a_large_text_blob... 7
2018-10-08 06:10:00 4 a_large_text_blob... 7
2018-10-08 09:07:03 4 a_large_text_blob... 8
2018-10-08 09:09:03 4 a_large_text_blob... 8
2018-10-09 10:01:00 4 a_large_text_blob... 9
2018-10-09 10:02:00 4 a_large_text_blob... 9
2018-10-09 10:03:00 4 a_large_text_blob... 9
2018-10-09 10:09:00 4 a_large_text_blob... 9
2018-10-09 10:09:00 5 a_large_text_blob... 10





share|improve this answer














You're most of the way there. To put it all together, build a boolean mask for each condition, then convert the masks to int and take their cumulative sum:



mask1 = df.timestamp.diff() > pd.Timedelta(10, 'm') 
mask2 = df['userID'].diff() != 0
mask3 = df['textBlob'].shift().str.lower() == 'restart'

df['new_id'] = (mask1 | mask2 | mask3).astype(int).cumsum()

# Result:
print(df.to_string(index=False))

timestamp userID textBlob new_id
2018-10-05 23:07:02 1 a_large_text_blob... 1
2018-10-05 23:07:13 1 a_large_text_blob... 1
2018-10-05 23:07:23 1 a_large_text_blob... 1
2018-10-05 23:07:36 1 a_large_text_blob... 1
2018-10-05 23:08:02 1 a_large_text_blob... 1
2018-10-05 23:09:16 1 a_large_text_blob... 1
2018-10-05 23:09:21 1 a_large_text_blob... 1
2018-10-05 23:09:39 1 a_large_text_blob... 1
2018-10-05 23:09:47 1 a_large_text_blob... 1
2018-10-05 23:10:01 1 a_large_text_blob... 1
2018-10-05 23:10:11 1 a_large_text_blob... 1
2018-10-05 23:10:23 1 restart 1
2018-10-05 23:10:59 1 a_large_text_blob... 2
2018-10-05 23:11:03 1 a_large_text_blob... 2
2018-10-08 03:11:32 2 a_large_text_blob... 3
2018-10-08 03:12:58 2 a_large_text_blob... 3
2018-10-08 03:13:16 2 a_large_text_blob... 3
2018-10-08 03:14:04 2 a_large_text_blob... 3
2018-10-08 03:38:36 2 a_large_text_blob... 4
2018-10-08 03:38:42 2 a_large_text_blob... 4
2018-10-08 03:38:52 2 a_large_text_blob... 4
2018-10-08 03:38:57 2 a_large_text_blob... 4
2018-10-08 03:39:10 2 a_large_text_blob... 4
2018-10-08 03:39:27 2 Restart 4
2018-10-08 03:40:47 2 a_large_text_blob... 5
2018-10-08 03:40:54 2 a_large_text_blob... 5
2018-10-08 03:41:02 2 a_large_text_blob... 5
2018-10-08 03:41:12 2 a_large_text_blob... 5
2018-10-08 03:41:32 2 a_large_text_blob... 5
2018-10-08 03:41:39 2 a_large_text_blob... 5
2018-10-08 03:42:20 2 a_large_text_blob... 5
2018-10-08 03:44:58 2 a_large_text_blob... 5
2018-10-08 03:45:54 2 a_large_text_blob... 5
2018-10-08 03:46:06 2 a_large_text_blob... 5
2018-10-08 05:06:42 3 a_large_text_blob... 6
2018-10-08 05:06:53 3 a_large_text_blob... 6
2018-10-08 05:08:49 3 a_large_text_blob... 6
2018-10-08 05:08:58 3 a_large_text_blob... 6
2018-10-08 05:58:18 4 a_large_text_blob... 7
2018-10-08 05:58:26 4 a_large_text_blob... 7
2018-10-08 05:58:37 4 a_large_text_blob... 7
2018-10-08 05:58:58 4 a_large_text_blob... 7
2018-10-08 06:00:31 4 a_large_text_blob... 7
2018-10-08 06:01:00 4 a_large_text_blob... 7
2018-10-08 06:01:14 4 a_large_text_blob... 7
2018-10-08 06:02:03 4 a_large_text_blob... 7
2018-10-08 06:02:03 4 a_large_text_blob... 7
2018-10-08 06:06:03 4 a_large_text_blob... 7
2018-10-08 06:10:00 4 a_large_text_blob... 7
2018-10-08 09:07:03 4 a_large_text_blob... 8
2018-10-08 09:09:03 4 a_large_text_blob... 8
2018-10-09 10:01:00 4 a_large_text_blob... 9
2018-10-09 10:02:00 4 a_large_text_blob... 9
2018-10-09 10:03:00 4 a_large_text_blob... 9
2018-10-09 10:09:00 4 a_large_text_blob... 9
2018-10-09 10:09:00 5 a_large_text_blob... 10






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 11 at 17:24

























answered Nov 11 at 16:51









Peter Leimbigler

3,6941415




3,6941415











  • I updated the data, thanks for the help
    – tumbleweed
    Nov 11 at 16:53






  • 1




    @tumbleweed, I've adjusted the restart logic, so this output should be what you're looking for. Let me know if I'm still missing anything.
    – Peter Leimbigler
    Nov 11 at 17:24
















  • I updated the data, thanks for the help
    – tumbleweed
    Nov 11 at 16:53






  • 1




    @tumbleweed, I've adjusted the restart logic, so this output should be what you're looking for. Let me know if I'm still missing anything.
    – Peter Leimbigler
    Nov 11 at 17:24















I updated the data, thanks for the help
– tumbleweed
Nov 11 at 16:53




I updated the data, thanks for the help
– tumbleweed
Nov 11 at 16:53




1




1




@tumbleweed, I've adjusted the restart logic, so this output should be what you're looking for. Let me know if I'm still missing anything.
– Peter Leimbigler
Nov 11 at 17:24




@tumbleweed, I've adjusted the restart logic, so this output should be what you're looking for. Let me know if I'm still missing anything.
– Peter Leimbigler
Nov 11 at 17:24












up vote
0
down vote













Ok I thought the 10 minutes period should count from the beginning of the conversation, not from the immediate below message, in that case you would need to iterate over the rows like:



df['timestamp'] = pd.to_datetime(df['timestamp'])
restart = df.textBlob.str.contains('|'.join(['restart','Restart']))
user_change = df.userID == df.userID.shift().fillna(method='bfill')
df['new_id'] = (restart | ~user_change).cumsum()
current_id = 0
new_id_prev = 0
start_time = df.timestamp.iloc[0]

for i, new_id, timestamp in zip(range(len(df)), df.new_id, df.timestamp):
timedelta = timestamp - start_time

if new_id != new_id_prev or timedelta > pd.Timedelta(10,unit='m'):
current_id += 1
start_time = timestamp

new_id_prev = new_id
df.new_id.iloc[i] = current_id





share|improve this answer
















  • 1




    Thanks for the help!
    – tumbleweed
    Nov 11 at 17:54














up vote
0
down vote













Ok I thought the 10 minutes period should count from the beginning of the conversation, not from the immediate below message, in that case you would need to iterate over the rows like:



df['timestamp'] = pd.to_datetime(df['timestamp'])
restart = df.textBlob.str.contains('|'.join(['restart','Restart']))
user_change = df.userID == df.userID.shift().fillna(method='bfill')
df['new_id'] = (restart | ~user_change).cumsum()
current_id = 0
new_id_prev = 0
start_time = df.timestamp.iloc[0]

for i, new_id, timestamp in zip(range(len(df)), df.new_id, df.timestamp):
timedelta = timestamp - start_time

if new_id != new_id_prev or timedelta > pd.Timedelta(10,unit='m'):
current_id += 1
start_time = timestamp

new_id_prev = new_id
df.new_id.iloc[i] = current_id





share|improve this answer
















  • 1




    Thanks for the help!
    – tumbleweed
    Nov 11 at 17:54












up vote
0
down vote










up vote
0
down vote









Ok I thought the 10 minutes period should count from the beginning of the conversation, not from the immediate below message, in that case you would need to iterate over the rows like:



df['timestamp'] = pd.to_datetime(df['timestamp'])
restart = df.textBlob.str.contains('|'.join(['restart','Restart']))
user_change = df.userID == df.userID.shift().fillna(method='bfill')
df['new_id'] = (restart | ~user_change).cumsum()
current_id = 0
new_id_prev = 0
start_time = df.timestamp.iloc[0]

for i, new_id, timestamp in zip(range(len(df)), df.new_id, df.timestamp):
timedelta = timestamp - start_time

if new_id != new_id_prev or timedelta > pd.Timedelta(10,unit='m'):
current_id += 1
start_time = timestamp

new_id_prev = new_id
df.new_id.iloc[i] = current_id





share|improve this answer












Ok I thought the 10 minutes period should count from the beginning of the conversation, not from the immediate below message, in that case you would need to iterate over the rows like:



df['timestamp'] = pd.to_datetime(df['timestamp'])
restart = df.textBlob.str.contains('|'.join(['restart','Restart']))
user_change = df.userID == df.userID.shift().fillna(method='bfill')
df['new_id'] = (restart | ~user_change).cumsum()
current_id = 0
new_id_prev = 0
start_time = df.timestamp.iloc[0]

for i, new_id, timestamp in zip(range(len(df)), df.new_id, df.timestamp):
timedelta = timestamp - start_time

if new_id != new_id_prev or timedelta > pd.Timedelta(10,unit='m'):
current_id += 1
start_time = timestamp

new_id_prev = new_id
df.new_id.iloc[i] = current_id






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 11 at 17:04









Franco Piccolo

1,335611




1,335611







  • 1




    Thanks for the help!
    – tumbleweed
    Nov 11 at 17:54












  • 1




    Thanks for the help!
    – tumbleweed
    Nov 11 at 17:54







1




1




Thanks for the help!
– tumbleweed
Nov 11 at 17:54




Thanks for the help!
– tumbleweed
Nov 11 at 17:54

















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53248280%2fproblems-while-trying-to-create-a-new-id-column-based-on-three-criteria%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

How to read a connectionString WITH PROVIDER in .NET Core?

In R, how to develop a multiplot heatmap.2 figure showing key labels successfully

Museum of Modern and Contemporary Art of Trento and Rovereto