A pythonic and uFunc-y way to turn pandas column into “increasing” index?

up vote
7
down vote

favorite

Let's say I have a pandas df like so:

Index   A     B

0      foo    3

1      foo    2

2      foo    5

3      bar    3

4      bar    4

5      baz    5

What's a good fast way to add a column like so:

Index   A     B    Aidx

0      foo    3    0

1      foo    2    0

2      foo    5    0

3      bar    3    1

4      bar    4    1

5      baz    5    2

I.e. adding an increasing index for each unique value?

I know I could use df.unique(), then use a dict and enumerate to create a lookup, and then apply that dictionary lookup to create the column. But I feel like there should be faster way, possibly involving groupby with some special function?

asked 2 hours ago

Lagerbaer

2,6231124

add a comment |

up vote
7
down vote

favorite

Let's say I have a pandas df like so:

Index   A     B

0      foo    3

1      foo    2

2      foo    5

3      bar    3

4      bar    4

5      baz    5

What's a good fast way to add a column like so:

Index   A     B    Aidx

0      foo    3    0

1      foo    2    0

2      foo    5    0

3      bar    3    1

4      bar    4    1

5      baz    5    2

I.e. adding an increasing index for each unique value?

asked 2 hours ago

Lagerbaer

2,6231124

add a comment |

up vote
7
down vote

favorite

Let's say I have a pandas df like so:

Index   A     B

0      foo    3

1      foo    2

2      foo    5

3      bar    3

4      bar    4

5      baz    5

What's a good fast way to add a column like so:

Index   A     B    Aidx

0      foo    3    0

1      foo    2    0

2      foo    5    0

3      bar    3    1

4      bar    4    1

5      baz    5    2

I.e. adding an increasing index for each unique value?

asked 2 hours ago

Lagerbaer

2,6231124

Let's say I have a pandas df like so:

Index   A     B

0      foo    3

1      foo    2

2      foo    5

3      bar    3

4      bar    4

5      baz    5

What's a good fast way to add a column like so:

Index   A     B    Aidx

0      foo    3    0

1      foo    2    0

2      foo    5    0

3      bar    3    1

4      bar    4    1

5      baz    5    2

I.e. adding an increasing index for each unique value?

python pandas

asked 2 hours ago

Lagerbaer

2,6231124

asked 2 hours ago

Lagerbaer

2,6231124

asked 2 hours ago

Lagerbaer

2,6231124

asked 2 hours ago

Lagerbaer

2,6231124

asked 2 hours ago

Lagerbaer

2,6231124

add a comment |

3 Answers
3

active

oldest

votes

up vote
7
down vote

One way is to use ngroup. Just remember you have to make sure your groupby isn't resorting the groups to get your desired output, so set sort=False:

df['Aidx'] = df.groupby('A',sort=False).ngroup()

>>> df

   Index    A  B  Aidx

0      0  foo  3     0

1      1  foo  2     0

2      2  foo  5     0

3      3  bar  3     1

4      4  bar  4     1

5      5  baz  5     2

edited 1 hour ago

answered 2 hours ago

sacul

29.7k41640

add a comment |

up vote
6
down vote

No need groupby using

Method 1factorize

pd.factorize(df.A)[0]

array([0, 0, 0, 1, 1, 2], dtype=int64)

#df['Aidx']=pd.factorize(df.A)[0]

Method 2 sklearn

from sklearn import preprocessing

le = preprocessing.LabelEncoder()

le.fit(df.A)

LabelEncoder()

le.transform(df.A)

array([2, 2, 2, 0, 0, 1])

Method 3 cat.codes

df.A.astype('category').cat.codes

Method 4 map + unique

l=df.A.unique()

df.A.map(dict(zip(l,range(len(l)))))

0    0

1    0

2    0

3    1

4    1

5    2

Name: A, dtype: int64

Method 5 np.unique

x,y=np.unique(df.A.values,return_inverse=True)

y

array([2, 2, 2, 0, 0, 1], dtype=int64)

edited 35 mins ago

answered 1 hour ago

W-B

97.8k73162

Good solutions, they should be really fast as well. May be add time comparison as OP is looking for the most efficient solution
– Vaishali
45 mins ago

@Vaishali sorry it is hard for me to get the timing , would you mind add that for me , thanks a lot
– W-B
33 mins ago

add a comment |

up vote
4
down vote

One more method of doing so could be.

df['C'] = i.ne(df.A.shift()).cumsum()-1

df

When we print df value it will be as follows.

  Index  A    B  C

0  0     foo  3  0

1  1     foo  2  0 

2  2     foo  5  0 

3  3     bar  3  1 

4  4     bar  4  1 

5  5     baz  5  2

Explanation of solution: Let's break above solution into parts for understanding purposes.

1st step: Compare df's A column by shifting its value down to itself as follows.

i.ne(df.A.shift())

Output we will get is:

0     True

1    False

2    False

3     True

4    False

5     True

2nd step: Use of cumsum() function, so wherever TRUE value is coming(which will come when a match of A column and its shift is NOT found) it will call cumsum() function and its value will be increased.

i.ne(df.A.shift()).cumsum()-1

0    0

1    0

2    0

3    1

4    1

5    2

Name: A, dtype: int32

3rd step: Save command's value into df['C'] which will create a new column named C in df.

edited 1 hour ago

answered 1 hour ago

RavinderSingh13

25k41437

1

Nice method ve++ for you
– W-B
1 hour ago

@W-B, thank you for encouragement sir, ++ve for your unique style already :)
– RavinderSingh13
1 hour ago

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53772121%2fa-pythonic-and-ufunc-y-way-to-turn-pandas-column-into-increasing-index%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
7
down vote

One way is to use ngroup. Just remember you have to make sure your groupby isn't resorting the groups to get your desired output, so set sort=False:

df['Aidx'] = df.groupby('A',sort=False).ngroup()

>>> df

   Index    A  B  Aidx

0      0  foo  3     0

1      1  foo  2     0

2      2  foo  5     0

3      3  bar  3     1

4      4  bar  4     1

5      5  baz  5     2

edited 1 hour ago

answered 2 hours ago

sacul

29.7k41640

add a comment |

up vote
7
down vote

One way is to use ngroup. Just remember you have to make sure your groupby isn't resorting the groups to get your desired output, so set sort=False:

df['Aidx'] = df.groupby('A',sort=False).ngroup()

>>> df

   Index    A  B  Aidx

0      0  foo  3     0

1      1  foo  2     0

2      2  foo  5     0

3      3  bar  3     1

4      4  bar  4     1

5      5  baz  5     2

edited 1 hour ago

answered 2 hours ago

sacul

29.7k41640

add a comment |

up vote
7
down vote

One way is to use ngroup. Just remember you have to make sure your groupby isn't resorting the groups to get your desired output, so set sort=False:

df['Aidx'] = df.groupby('A',sort=False).ngroup()

>>> df

   Index    A  B  Aidx

0      0  foo  3     0

1      1  foo  2     0

2      2  foo  5     0

3      3  bar  3     1

4      4  bar  4     1

5      5  baz  5     2

edited 1 hour ago

answered 2 hours ago

sacul

29.7k41640

One way is to use ngroup. Just remember you have to make sure your groupby isn't resorting the groups to get your desired output, so set sort=False:

df['Aidx'] = df.groupby('A',sort=False).ngroup()

>>> df

   Index    A  B  Aidx

0      0  foo  3     0

1      1  foo  2     0

2      2  foo  5     0

3      3  bar  3     1

4      4  bar  4     1

5      5  baz  5     2

edited 1 hour ago

answered 2 hours ago

sacul

29.7k41640

edited 1 hour ago

answered 2 hours ago

sacul

29.7k41640

answered 2 hours ago

sacul

29.7k41640

answered 2 hours ago

sacul

29.7k41640

add a comment |

up vote
6
down vote

No need groupby using

Method 1factorize

pd.factorize(df.A)[0]

array([0, 0, 0, 1, 1, 2], dtype=int64)

#df['Aidx']=pd.factorize(df.A)[0]

Method 2 sklearn

from sklearn import preprocessing

le = preprocessing.LabelEncoder()

le.fit(df.A)

LabelEncoder()

le.transform(df.A)

array([2, 2, 2, 0, 0, 1])

Method 3 cat.codes

df.A.astype('category').cat.codes

Method 4 map + unique

l=df.A.unique()

df.A.map(dict(zip(l,range(len(l)))))

0    0

1    0

2    0

3    1

4    1

5    2

Name: A, dtype: int64

Method 5 np.unique

x,y=np.unique(df.A.values,return_inverse=True)

y

array([2, 2, 2, 0, 0, 1], dtype=int64)

edited 35 mins ago

answered 1 hour ago

W-B

97.8k73162

Good solutions, they should be really fast as well. May be add time comparison as OP is looking for the most efficient solution
– Vaishali
45 mins ago

@Vaishali sorry it is hard for me to get the timing , would you mind add that for me , thanks a lot
– W-B
33 mins ago

add a comment |

up vote
6
down vote

No need groupby using

Method 1factorize

pd.factorize(df.A)[0]

array([0, 0, 0, 1, 1, 2], dtype=int64)

#df['Aidx']=pd.factorize(df.A)[0]

Method 2 sklearn

from sklearn import preprocessing

le = preprocessing.LabelEncoder()

le.fit(df.A)

LabelEncoder()

le.transform(df.A)

array([2, 2, 2, 0, 0, 1])

Method 3 cat.codes

df.A.astype('category').cat.codes

Method 4 map + unique

l=df.A.unique()

df.A.map(dict(zip(l,range(len(l)))))

0    0

1    0

2    0

3    1

4    1

5    2

Name: A, dtype: int64

Method 5 np.unique

x,y=np.unique(df.A.values,return_inverse=True)

y

array([2, 2, 2, 0, 0, 1], dtype=int64)

edited 35 mins ago

answered 1 hour ago

W-B

97.8k73162

Good solutions, they should be really fast as well. May be add time comparison as OP is looking for the most efficient solution
– Vaishali
45 mins ago

@Vaishali sorry it is hard for me to get the timing , would you mind add that for me , thanks a lot
– W-B
33 mins ago

add a comment |

up vote
6
down vote

No need groupby using

Method 1factorize

pd.factorize(df.A)[0]

array([0, 0, 0, 1, 1, 2], dtype=int64)

#df['Aidx']=pd.factorize(df.A)[0]

Method 2 sklearn

from sklearn import preprocessing

le = preprocessing.LabelEncoder()

le.fit(df.A)

LabelEncoder()

le.transform(df.A)

array([2, 2, 2, 0, 0, 1])

Method 3 cat.codes

df.A.astype('category').cat.codes

Method 4 map + unique

l=df.A.unique()

df.A.map(dict(zip(l,range(len(l)))))

0    0

1    0

2    0

3    1

4    1

5    2

Name: A, dtype: int64

Method 5 np.unique

x,y=np.unique(df.A.values,return_inverse=True)

y

array([2, 2, 2, 0, 0, 1], dtype=int64)

edited 35 mins ago

answered 1 hour ago

W-B

97.8k73162

No need groupby using

Method 1factorize

pd.factorize(df.A)[0]

array([0, 0, 0, 1, 1, 2], dtype=int64)

#df['Aidx']=pd.factorize(df.A)[0]

Method 2 sklearn

from sklearn import preprocessing

le = preprocessing.LabelEncoder()

le.fit(df.A)

LabelEncoder()

le.transform(df.A)

array([2, 2, 2, 0, 0, 1])

Method 3 cat.codes

df.A.astype('category').cat.codes

Method 4 map + unique

l=df.A.unique()

df.A.map(dict(zip(l,range(len(l)))))

0    0

1    0

2    0

3    1

4    1

5    2

Name: A, dtype: int64

Method 5 np.unique

x,y=np.unique(df.A.values,return_inverse=True)

y

array([2, 2, 2, 0, 0, 1], dtype=int64)

edited 35 mins ago

answered 1 hour ago

W-B

97.8k73162

edited 35 mins ago

answered 1 hour ago

W-B

97.8k73162

answered 1 hour ago

W-B

97.8k73162

answered 1 hour ago

W-B

97.8k73162

Good solutions, they should be really fast as well. May be add time comparison as OP is looking for the most efficient solution
– Vaishali
45 mins ago

@Vaishali sorry it is hard for me to get the timing , would you mind add that for me , thanks a lot
– W-B
33 mins ago

add a comment |

Good solutions, they should be really fast as well. May be add time comparison as OP is looking for the most efficient solution
– Vaishali
45 mins ago

@Vaishali sorry it is hard for me to get the timing , would you mind add that for me , thanks a lot
– W-B
33 mins ago

Good solutions, they should be really fast as well. May be add time comparison as OP is looking for the most efficient solution
– Vaishali
45 mins ago

@Vaishali sorry it is hard for me to get the timing , would you mind add that for me , thanks a lot
– W-B
33 mins ago

add a comment |

up vote
4
down vote

One more method of doing so could be.

df['C'] = i.ne(df.A.shift()).cumsum()-1

df

When we print df value it will be as follows.

  Index  A    B  C

0  0     foo  3  0

1  1     foo  2  0 

2  2     foo  5  0 

3  3     bar  3  1 

4  4     bar  4  1 

5  5     baz  5  2

Explanation of solution: Let's break above solution into parts for understanding purposes.

1st step: Compare df's A column by shifting its value down to itself as follows.

i.ne(df.A.shift())

Output we will get is:

0     True

1    False

2    False

3     True

4    False

5     True

i.ne(df.A.shift()).cumsum()-1

0    0

1    0

2    0

3    1

4    1

5    2

Name: A, dtype: int32

3rd step: Save command's value into df['C'] which will create a new column named C in df.

edited 1 hour ago

answered 1 hour ago

RavinderSingh13

25k41437

1

Nice method ve++ for you
– W-B
1 hour ago

@W-B, thank you for encouragement sir, ++ve for your unique style already :)
– RavinderSingh13
1 hour ago

add a comment |

up vote
4
down vote

One more method of doing so could be.

df['C'] = i.ne(df.A.shift()).cumsum()-1

df

When we print df value it will be as follows.

  Index  A    B  C

0  0     foo  3  0

1  1     foo  2  0 

2  2     foo  5  0 

3  3     bar  3  1 

4  4     bar  4  1 

5  5     baz  5  2

Explanation of solution: Let's break above solution into parts for understanding purposes.

1st step: Compare df's A column by shifting its value down to itself as follows.

i.ne(df.A.shift())

Output we will get is:

0     True

1    False

2    False

3     True

4    False

5     True

i.ne(df.A.shift()).cumsum()-1

0    0

1    0

2    0

3    1

4    1

5    2

Name: A, dtype: int32

3rd step: Save command's value into df['C'] which will create a new column named C in df.

edited 1 hour ago

answered 1 hour ago

RavinderSingh13

25k41437

1

Nice method ve++ for you
– W-B
1 hour ago

@W-B, thank you for encouragement sir, ++ve for your unique style already :)
– RavinderSingh13
1 hour ago

add a comment |

up vote
4
down vote

One more method of doing so could be.

df['C'] = i.ne(df.A.shift()).cumsum()-1

df

When we print df value it will be as follows.

  Index  A    B  C

0  0     foo  3  0

1  1     foo  2  0 

2  2     foo  5  0 

3  3     bar  3  1 

4  4     bar  4  1 

5  5     baz  5  2

Explanation of solution: Let's break above solution into parts for understanding purposes.

1st step: Compare df's A column by shifting its value down to itself as follows.

i.ne(df.A.shift())

Output we will get is:

0     True

1    False

2    False

3     True

4    False

5     True

i.ne(df.A.shift()).cumsum()-1

0    0

1    0

2    0

3    1

4    1

5    2

Name: A, dtype: int32

3rd step: Save command's value into df['C'] which will create a new column named C in df.

edited 1 hour ago

answered 1 hour ago

RavinderSingh13

25k41437

One more method of doing so could be.

df['C'] = i.ne(df.A.shift()).cumsum()-1

df

When we print df value it will be as follows.

  Index  A    B  C

0  0     foo  3  0

1  1     foo  2  0 

2  2     foo  5  0 

3  3     bar  3  1 

4  4     bar  4  1 

5  5     baz  5  2

Explanation of solution: Let's break above solution into parts for understanding purposes.

1st step: Compare df's A column by shifting its value down to itself as follows.

i.ne(df.A.shift())

Output we will get is:

0     True

1    False

2    False

3     True

4    False

5     True

i.ne(df.A.shift()).cumsum()-1

0    0

1    0

2    0

3    1

4    1

5    2

Name: A, dtype: int32

3rd step: Save command's value into df['C'] which will create a new column named C in df.

edited 1 hour ago

answered 1 hour ago

RavinderSingh13

25k41437

edited 1 hour ago

answered 1 hour ago

RavinderSingh13

25k41437

answered 1 hour ago

RavinderSingh13

25k41437

answered 1 hour ago

RavinderSingh13

25k41437

1

Nice method ve++ for you
– W-B
1 hour ago

@W-B, thank you for encouragement sir, ++ve for your unique style already :)
– RavinderSingh13
1 hour ago

add a comment |

1

Nice method ve++ for you
– W-B
1 hour ago

@W-B, thank you for encouragement sir, ++ve for your unique style already :)
– RavinderSingh13
1 hour ago

Nice method ve++ for you
– W-B
1 hour ago

@W-B, thank you for encouragement sir, ++ve for your unique style already :)
– RavinderSingh13
1 hour ago

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Vrftsjtryk