따릉이
[머신러닝] Pandas 기초 본문
0. 팬더스, 데이터프레임, 시리즈¶
In [ ]:
import pandas as pd
data_frame = pd.read_csv("test.csv") # csv파일 읽어서 데이터 프레임 생성
In [15]:
data_frame
Out[15]:
name | age | job | |
---|---|---|---|
0 | John | 20 | student |
1 | Julia | 30 | teacher |
2 | Brian | 45 | manager |
3 | Chris | 25 | intern |
In [17]:
data_frame.head(2) # 데이터 프레임의 맨 앞 2행
Out[17]:
name | age | job | |
---|---|---|---|
0 | John | 20 | student |
1 | Julia | 30 | teacher |
In [19]:
data_frame.tail(2) # 데이터 프레임의 맨 뒤 2행
Out[19]:
name | age | job | |
---|---|---|---|
2 | Brian | 45 | manager |
3 | Chris | 25 | intern |
In [209]:
type(data_frame.job) # data frame의 각 열은 Series이다.
Out[209]:
pandas.core.series.Series
In [28]:
s1 = pd.core.series.Series([1, 2, 3])
s2 = pd.core.series.Series(["one", "two", "three"])
pd.DataFrame(data=dict(num=s1, word=s2)) # series들로 data frame 생성
Out[28]:
num | word | |
---|---|---|
0 | 1 | one |
1 | 2 | two |
2 | 3 | three |
1. 파일에서 데이터 불러오기¶
In [54]:
# csv파일 읽어서 데이터 프레임 생성 (delimiter 명시, header 명시)
# delimiter 기본값: ',' | header 기본값: 0
# header로 명시된 행은 header가 되고, header로 명시된 행 + 1 부터는 데이터가 된다.
# header가 None이면, header는 0, 1, 2, 3, ...으로 생성된다.
df = pd.read_csv("data/friend_list.csv", delimiter = ',', header = None)
df
Out[54]:
0 | 1 | 2 | |
---|---|---|---|
0 | name | age | job |
1 | John | 20 | student |
2 | Jenny | 30 | developer |
3 | Nate | 30 | teacher |
4 | Julia | 40 | dentist |
5 | Brian | 45 | manager |
6 | Chris | 25 | intern |
In [69]:
# data frame의 column header 변경하기
df.columns = ['이름', '나이', '직업']
df
Out[69]:
이름 | 나이 | 직업 | |
---|---|---|---|
0 | Jenny | 30 | developer |
1 | Nate | 30 | teacher |
2 | Julia | 40 | dentist |
3 | Brian | 45 | manager |
4 | Chris | 25 | intern |
In [4]:
# 파일을 불러올 때 names를 통해 따로 헤더명을 명시할 수 있다.
df = pd.read_csv("data/friend_list_no_head.csv", header = None, names = ['name', 'age', 'job'])
df
Out[4]:
name | age | job | |
---|---|---|---|
0 | John | 20 | student |
1 | Jenny | 30 | developer |
2 | Nate | 30 | teacher |
3 | Julia | 40 | dentist |
4 | Brian | 45 | manager |
5 | Chris | 25 | intern |
2. 데이터프레임 생성하기¶
In [54]:
# dict는 순서가 없기 때문에, column의 순서는 랜덤하게 결정된다.
friend_dict_list = [
{"name": "John", "age": 25, "job": "student"},
{"name": "Nate", "age": 30, "job": "teacher"}
]
In [65]:
df = pd.DataFrame(friend_dict_list)
df.head()
Out[65]:
name | age | job | |
---|---|---|---|
0 | John | 25 | student |
1 | Nate | 30 | teacher |
In [60]:
# dataframe의 column순서 변경하기
df = df[["name", "job", "age"]]
df.head()
Out[60]:
name | job | age | |
---|---|---|---|
0 | John | student | 25 |
1 | Nate | teacher | 30 |
In [79]:
from collections import OrderedDict # dataframe 생성시 column순서를 보장하기 위함
In [71]:
# tuple을 (key(column name),[value(column element)]) 형식으로 list를 만들어서 OrderedDict를 생성한다.
friend_ordered_dict = OrderedDict(
[
("name", ["John", "Nate"]),
("age", [25, 30]),
("job", ["student", "teacher"])
]
)
df = pd.DataFrame.from_dict(friend_ordered_dict)
df.head()
Out[71]:
name | age | job | |
---|---|---|---|
0 | John | 25 | student |
1 | Nate | 30 | teacher |
In [75]:
# list를 이용해서 dataframe을 생성하기 (data, header 따로 정의)
friend_list = [
["John", 20, "student"],
["Nate", 30, "teacher"]
]
column_name = ["name", "age", "job"]
df = pd.DataFrame.from_records(friend_list, columns = column_name)
df.head()
Out[75]:
name | age | job | |
---|---|---|---|
0 | John | 20 | student |
1 | Nate | 30 | teacher |
In [82]:
# list를 이용해서 dataframe을 생성하기 (data, header 한번에 정의) (OrderedDict를 이용한 간접적인 방식)
friend_list = [
["name", ["John", "Nate"]],
["age", [20, 30]],
["job", ["student", "teacher"]]
]
df = pd.DataFrame.from_dict(OrderedDict(friend_list))
df.head()
Out[82]:
name | age | job | |
---|---|---|---|
0 | John | 20 | student |
1 | Nate | 30 | teacher |
3. 데이터프레임 파일로 저장하기¶
In [88]:
friends = [
{"name": "John", "age": 25, "job": "student"},
{"name": "Nate", "age": 30, "job": "teacher"},
{"name": "Jenny", "age": 30, "job": None}
]
df = pd.DataFrame(friends)
df = df[["name", "age", "job"]]
df.head()
Out[88]:
name | age | job | |
---|---|---|---|
0 | John | 25 | student |
1 | Nate | 30 | teacher |
2 | Jenny | 30 | None |
In [90]:
df.to_csv('friends.csv', index = True, header = True, na_rep = '-') # index와 header의 기본값은 True이다. na_rep은 None value를 해당 값으로 저장한다.
4. 데이터프레임 행, 열 선택 및 필터 하기¶
In [125]:
friend_list = OrderedDict(
[
("name", ["John", "Nate", "Jenny"]),
("age", [25, 30, 30]),
("job", ["student", "teacher", "developer"])
]
)
df = pd.DataFrame.from_dict(friend_list)
df.head()
Out[125]:
name | age | job | |
---|---|---|---|
0 | John | 25 | student |
1 | Nate | 30 | teacher |
2 | Jenny | 30 | developer |
In [95]:
df[1:3] # slicing (row 기준) 된 복사본을 return.
Out[95]:
name | age | job | |
---|---|---|---|
1 | Nate | 30 | teacher |
2 | Jenny | 30 | developer |
In [97]:
df.loc[ [0, 2] ] # 불연속적인 특정 row만 선택해서 return
Out[97]:
name | age | job | |
---|---|---|---|
0 | John | 25 | student |
2 | Jenny | 30 | developer |
column condition 사용하기¶
In [106]:
df[df.age > 25]
Out[106]:
name | age | job | |
---|---|---|---|
1 | Nate | 30 | teacher |
2 | Jenny | 30 | developer |
In [102]:
df.query('age > 25')
Out[102]:
name | age | job | |
---|---|---|---|
1 | Nate | 30 | teacher |
2 | Jenny | 30 | developer |
In [126]:
df[(df.age > 25) & (df.name == "Nate")]
Out[126]:
name | age | job | |
---|---|---|---|
1 | Nate | 30 | teacher |
Filter column by index¶
In [119]:
friend_list = [
["John", 20, "student"],
["Nate", 30, "teacher"],
["Jenny", 30, "developer"]
]
df = pd.DataFrame.from_records(friend_list)
df
Out[119]:
0 | 1 | 2 | |
---|---|---|---|
0 | John | 20 | student |
1 | Nate | 30 | teacher |
2 | Jenny | 30 | developer |
In [121]:
df.iloc[:, 0:2] # [row, column]
Out[121]:
0 | 1 | |
---|---|---|
0 | John | 20 |
1 | Nate | 30 |
2 | Jenny | 30 |
In [123]:
df.iloc[0:2, 0:2]
Out[123]:
0 | 1 | |
---|---|---|
0 | John | 20 |
1 | Nate | 30 |
Filter column by index¶
In [128]:
df = pd.read_csv('data/friend_list_no_head.csv', header = None, names=["name", "age", "job"])
df
Out[128]:
name | age | job | |
---|---|---|---|
0 | John | 20 | student |
1 | Jenny | 30 | developer |
2 | Nate | 30 | teacher |
3 | Julia | 40 | dentist |
4 | Brian | 45 | manager |
5 | Chris | 25 | intern |
In [132]:
df[["name", "age"]] # name, age column만 return
Out[132]:
name | age | |
---|---|---|
0 | John | 20 |
1 | Jenny | 30 |
2 | Nate | 30 |
3 | Julia | 40 |
4 | Brian | 45 |
5 | Chris | 25 |
In [134]:
df.filter(items = ["age", "job"]) # age, job column만 return
Out[134]:
age | job | |
---|---|---|
0 | 20 | student |
1 | 30 | developer |
2 | 30 | teacher |
3 | 40 | dentist |
4 | 45 | manager |
5 | 25 | intern |
In [142]:
# column name에 'a'가 포함된 column을 return.
# axis = 0: 행
# axis = 1: 열
df.filter(like = "a", axis = 1)
Out[142]:
name | age | |
---|---|---|
0 | John | 20 |
1 | Jenny | 30 |
2 | Nate | 30 |
3 | Julia | 40 |
4 | Brian | 45 |
5 | Chris | 25 |
In [146]:
# regex 사용
df.filter(regex="b$", axis=1) # b로 끝나는 column을 return.
Out[146]:
job | |
---|---|
0 | student |
1 | developer |
2 | teacher |
3 | dentist |
4 | manager |
5 | intern |
5. 데이터프레임 행, 열 삭제하기¶
In [161]:
friends = [
{"age": 15, "job": "student"},
{"age": 25, "job": "developer"},
{"age": 30, "job": "teacher"}
]
df = pd.DataFrame(friends, index = ["John", "Jenny", "Nate"], columns = ["age", "job"])
df
Out[161]:
age | job | |
---|---|---|
John | 15 | student |
Jenny | 25 | developer |
Nate | 30 | teacher |
In [153]:
df.drop(["John", "Nate"]) # 행 삭제해서 return
Out[153]:
age | job | |
---|---|---|
Jenny | 25 | developer |
In [160]:
df.drop(["John", "Nate"], inplace = True) # inplcae를 True로 하면 원본이 수정된다.
df
Out[160]:
age | job | |
---|---|---|
Jenny | 25 | developer |
In [163]:
friends = [
{"name": "John", "age": 15, "job": "student"},
{"name": "Jenny", "age": 25, "job": "developer"},
{"name": "Nate", "age": 30, "job": "teacher"}
]
df = pd.DataFrame(friends, columns = ["name", "age", "job"])
df
Out[163]:
name | age | job | |
---|---|---|---|
0 | John | 15 | student |
1 | Jenny | 25 | developer |
2 | Nate | 30 | teacher |
In [164]:
df.drop(df.index[[0, 2]]) # index이용 행 삭제
Out[164]:
name | age | job | |
---|---|---|---|
1 | Jenny | 25 | developer |
In [166]:
df.drop(["age"], axis = 1) # 열 삭제해서 return
Out[166]:
name | job | |
---|---|---|
0 | John | student |
1 | Jenny | developer |
2 | Nate | teacher |
6. 행, 열 생성 및 수정하기¶
In [167]:
friends = [
{"name": "John", "age": 15, "job": "student"},
{"name": "Jenny", "age": 25, "job": "developer"},
{"name": "Nate", "age": 30, "job": "teacher"}
]
df = pd.DataFrame(friends, columns = ["name", "age", "job"])
df
Out[167]:
name | age | job | |
---|---|---|---|
0 | John | 15 | student |
1 | Jenny | 25 | developer |
2 | Nate | 30 | teacher |
In [169]:
df['salary'] = 0 # 'salary' 열 추가
df
Out[169]:
name | age | job | salary | |
---|---|---|---|---|
0 | John | 15 | student | 0 |
1 | Jenny | 25 | developer | 0 |
2 | Nate | 30 | teacher | 0 |
In [172]:
import numpy as np
df['salary'] = np.where(df['job'] != "student", "yes", "no") # job이 student이면 salary를 no, 그 외는 yes로 한다.
df
Out[172]:
name | age | job | salary | |
---|---|---|---|---|
0 | John | 15 | student | no |
1 | Jenny | 25 | developer | yes |
2 | Nate | 30 | teacher | yes |
In [174]:
friends = [
{"name": "John", "midterm": 95, "final": 85},
{"name": "Jenny", "midterm": 85, "final": 80},
{"name": "Nate", "midterm": 30, "final": 10}
]
df = pd.DataFrame(friends, columns = ["name", "midterm", "final"])
df
Out[174]:
name | midterm | final | |
---|---|---|---|
0 | John | 95 | 85 |
1 | Jenny | 85 | 80 |
2 | Nate | 30 | 10 |
In [177]:
df['total'] = df['midterm'] + df['final']
df
Out[177]:
name | midterm | final | total | |
---|---|---|---|---|
0 | John | 95 | 85 | 180 |
1 | Jenny | 85 | 80 | 165 |
2 | Nate | 30 | 10 | 40 |
In [179]:
df['average'] = df['total'] / 2
df
Out[179]:
name | midterm | final | total | average | |
---|---|---|---|---|---|
0 | John | 95 | 85 | 180 | 90.0 |
1 | Jenny | 85 | 80 | 165 | 82.5 |
2 | Nate | 30 | 10 | 40 | 20.0 |
In [193]:
grades = []
for row in df['average']:
if row >= 90:
grades.append('A')
elif row >= 80:
grades.append('B')
else:
grades.append('F')
df['grade'] = grades # list로 새로운 열 생성 가능.
df
Out[193]:
name | midterm | final | total | average | grade | |
---|---|---|---|---|---|---|
0 | John | 95 | 85 | 180 | 90.0 | A |
1 | Jenny | 85 | 80 | 165 | 82.5 | B |
2 | Nate | 30 | 10 | 40 | 20.0 | F |
In [194]:
def pass_or_fail(row):
if row != "F":
return "Pass"
else:
return "Fail"
df.grade = df.grade.apply(pass_or_fail) # 인자로 받은 함수(pass_or_fail)의 인자로 각 element가 들어가고, return값들로 column을 생성한다.
df
Out[194]:
name | midterm | final | total | average | grade | |
---|---|---|---|---|---|---|
0 | John | 95 | 85 | 180 | 90.0 | Pass |
1 | Jenny | 85 | 80 | 165 | 82.5 | Pass |
2 | Nate | 30 | 10 | 40 | 20.0 | Fail |
In [197]:
date_list = [
{"yyyy-mm-dd": "2000-06-27"},
{"yyyy-mm-dd": "2007-10-27"}
]
df = pd.DataFrame(date_list, columns = ["yyyy-mm-dd"])
df
Out[197]:
yyyy-mm-dd | |
---|---|
0 | 2000-06-27 |
1 | 2007-10-27 |
In [198]:
# 년도만 추출해서 새로운 column 추가
def extract_year(row):
return row.split('-')[0]
df['year'] = df['yyyy-mm-dd'].apply(extract_year)
df
Out[198]:
yyyy-mm-dd | year | |
---|---|---|
0 | 2000-06-27 | 2000 |
1 | 2007-10-27 | 2007 |
In [206]:
friends = [
{"name": "John", "midterm": 95, "final": 85},
{"name": "Jenny", "midterm": 85, "final": 80},
{"name": "Nate", "midterm": 30, "final": 10}
]
df = pd.DataFrame(friends, columns = ["name", "midterm", "final"])
df
Out[206]:
name | midterm | final | |
---|---|---|---|
0 | John | 95 | 85 |
1 | Jenny | 85 | 80 |
2 | Nate | 30 | 10 |
In [201]:
df2 = pd.DataFrame([["Ben", 50, 50]], columns = ["name", "midterm", "final"])
df2
Out[201]:
name | midterm | final | |
---|---|---|---|
0 | Ben | 50 | 50 |
In [208]:
#row 추가하기 (다른 datafrma을 이용해서 (df와 df2의 column구성은 같음))
df.append(df2, ignore_index = True)
Out[208]:
name | midterm | final | |
---|---|---|---|
0 | John | 95 | 85 |
1 | Jenny | 85 | 80 |
2 | Nate | 30 | 10 |
3 | Ben | 50 | 50 |
In [210]:
# list로 새로운 row 추가하기
df.append(pd.Series(["Ben", 50, 50], index=df.columns), ignore_index = True)
Out[210]:
name | midterm | final | |
---|---|---|---|
0 | John | 95 | 85 |
1 | Jenny | 85 | 80 |
2 | Nate | 30 | 10 |
3 | Ben | 50 | 50 |
7. 데이터 그룹 만들기¶
In [213]:
student_list = [
{"name": "John", "major": "Computer Science", "sex": "male"},
{"name": "Nate", "major": "Computer Science", "sex": "male"},
{"name": "Abraham", "major": "Physics", "sex": "male"},
{"name": "Brian", "major": "Psychology", "sex": "male"},
{"name": "Janny", "major": "Economics", "sex": "female"},
{"name": "Yuna", "major": "Economics", "sex": "female"},
{"name": "Jeniffer", "major": "Computer Science", "sex": "female"},
{"name": "Edward", "major": "Computer Science", "sex": "male"},
{"name": "Zara", "major": "Psycholog", "sex": "female"},
{"name": "Wendy", "major": "Economics", "sex": "female"},
{"name": "Sara", "major": "Psychology", "sex": "female"}
]
df = pd.DataFrame(student_list, columns=["name", "major", "sex"])
df
Out[213]:
name | major | sex | |
---|---|---|---|
0 | John | Computer Science | male |
1 | Nate | Computer Science | male |
2 | Abraham | Physics | male |
3 | Brian | Psychology | male |
4 | Janny | Economics | female |
5 | Yuna | Economics | female |
6 | Jeniffer | Computer Science | female |
7 | Edward | Computer Science | male |
8 | Zara | Psycholog | female |
9 | Wendy | Economics | female |
10 | Sara | Psychology | female |
In [215]:
groupby_major = df.groupby('major')
groupby_major.groups
Out[215]:
{'Computer Science': [0, 1, 6, 7], 'Economics': [4, 5, 9], 'Physics': [2], 'Psycholog': [8], 'Psychology': [3, 10]}
In [217]:
for name, group in groupby_major:
print(name + " : " + str(len(group)))
print(group)
print()
Computer Science : 4 name major sex 0 John Computer Science male 1 Nate Computer Science male 6 Jeniffer Computer Science female 7 Edward Computer Science male Economics : 3 name major sex 4 Janny Economics female 5 Yuna Economics female 9 Wendy Economics female Physics : 1 name major sex 2 Abraham Physics male Psycholog : 1 name major sex 8 Zara Psycholog female Psychology : 2 name major sex 3 Brian Psychology male 10 Sara Psychology female
In [220]:
df_major_cnt = pd.DataFrame({"count": groupby_major.size()}).reset_index()
df_major_cnt
Out[220]:
major | count | |
---|---|---|
0 | Computer Science | 4 |
1 | Economics | 3 |
2 | Physics | 1 |
3 | Psycholog | 1 |
4 | Psychology | 2 |
In [221]:
groupby_sex = df.groupby('sex')
groupby_sex.groups
Out[221]:
{'female': [4, 5, 6, 8, 9, 10], 'male': [0, 1, 2, 3, 7]}
In [223]:
for name, group in groupby_sex:
print(name + " : " + str(len(group)))
print(group)
print()
female : 6 name major sex 4 Janny Economics female 5 Yuna Economics female 6 Jeniffer Computer Science female 8 Zara Psycholog female 9 Wendy Economics female 10 Sara Psychology female male : 5 name major sex 0 John Computer Science male 1 Nate Computer Science male 2 Abraham Physics male 3 Brian Psychology male 7 Edward Computer Science male
8. 중복 데이터 삭제하기¶
In [225]:
student_list = [
{"name": "John", "major": "Computer Science", "sex": "male"},
{"name": "Nate", "major": "Computer Science", "sex": "male"},
{"name": "Abraham", "major": "Physics", "sex": "male"},
{"name": "Brian", "major": "Psychology", "sex": "male"},
{"name": "John", "major": "Computer Science", "sex": "male"},
]
df = pd.DataFrame(student_list, columns=["name", "major", "sex"])
df
Out[225]:
name | major | sex | |
---|---|---|---|
0 | John | Computer Science | male |
1 | Nate | Computer Science | male |
2 | Abraham | Physics | male |
3 | Brian | Psychology | male |
4 | John | Computer Science | male |
In [227]:
# 위 df의 4번째 row는 0번째 row와 완전히 같은 값을 갖고 있다.
df.duplicated()
Out[227]:
0 False 1 False 2 False 3 False 4 True dtype: bool
In [229]:
# 중복된 row 제거하기
df.drop_duplicates()
Out[229]:
name | major | sex | |
---|---|---|---|
0 | John | Computer Science | male |
1 | Nate | Computer Science | male |
2 | Abraham | Physics | male |
3 | Brian | Psychology | male |
In [230]:
student_list = [
{"name": "John", "major": "Computer Science", "sex": "male"},
{"name": "Nate", "major": "Computer Science", "sex": "male"},
{"name": "Abraham", "major": "Physics", "sex": "male"},
{"name": "Brian", "major": "Psychology", "sex": "male"},
{"name": "John", "major": "Economics", "sex": "male"},
{"name": "Nate", "major": "Physics", "sex": "male"},
]
df = pd.DataFrame(student_list, columns=["name", "major", "sex"])
df
Out[230]:
name | major | sex | |
---|---|---|---|
0 | John | Computer Science | male |
1 | Nate | Computer Science | male |
2 | Abraham | Physics | male |
3 | Brian | Psychology | male |
4 | John | Economics | male |
5 | Nate | Physics | male |
In [232]:
# 'name' column에 대해서만 중복 검사
df.duplicated(["name"])
Out[232]:
0 False 1 False 2 False 3 False 4 True 5 True dtype: bool
In [236]:
# 'name' column에 대해서만 중복된 row 제거
# keep은 'first'이면 중복된것 중에서 맨 첫번째 값만 남기고, 'last'이면 맨 마지막 값만 남긴다. 그리고 False이면 모든 중복된 것을 지운다.
df.drop_duplicates(["name"], keep = 'first')
Out[236]:
name | major | sex | |
---|---|---|---|
0 | John | Computer Science | male |
1 | Nate | Computer Science | male |
2 | Abraham | Physics | male |
3 | Brian | Psychology | male |
9. NaN 찾아서 다른 값으로 변경하기¶
In [258]:
school_id_list = [
{"name": "John", "job": "teacher", "age": 40},
{"name": "Nate", "job": "teacher", "age": 35},
{"name": "Yuna", "job": "teacher", "age": 37},
{"name": "Abraham", "job": "student", "age": 10},
{"name": "Brian", "job": "student", "age": 12},
{"name": "Janny", "job": "student", "age": 11},
{"name": "Nate", "job": "teacher", "age": None},
{"name": "John", "job": "student", "age": None}
]
df = pd.DataFrame(school_id_list, columns=["name", "job", "age"])
df
Out[258]:
name | job | age | |
---|---|---|---|
0 | John | teacher | 40.0 |
1 | Nate | teacher | 35.0 |
2 | Yuna | teacher | 37.0 |
3 | Abraham | student | 10.0 |
4 | Brian | student | 12.0 |
5 | Janny | student | 11.0 |
6 | Nate | teacher | NaN |
7 | John | student | NaN |
In [241]:
df.shape
Out[241]:
(8, 3)
In [243]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8 entries, 0 to 7 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 8 non-null object 1 job 8 non-null object 2 age 6 non-null float64 dtypes: float64(1), object(2) memory usage: 320.0+ bytes
In [249]:
df.isna()
Out[249]:
name | job | age | |
---|---|---|---|
0 | False | False | False |
1 | False | False | False |
2 | False | False | False |
3 | False | False | False |
4 | False | False | False |
5 | False | False | False |
6 | False | False | True |
7 | False | False | True |
In [247]:
df.isnull()
Out[247]:
name | job | age | |
---|---|---|---|
0 | False | False | False |
1 | False | False | False |
2 | False | False | False |
3 | False | False | False |
4 | False | False | False |
5 | False | False | False |
6 | False | False | True |
7 | False | False | True |
In [259]:
df.age = df.age.fillna(0)
df
Out[259]:
name | job | age | |
---|---|---|---|
0 | John | teacher | 40.0 |
1 | Nate | teacher | 35.0 |
2 | Yuna | teacher | 37.0 |
3 | Abraham | student | 10.0 |
4 | Brian | student | 12.0 |
5 | Janny | student | 11.0 |
6 | Nate | teacher | 0.0 |
7 | John | student | 0.0 |
In [266]:
df = pd.DataFrame(school_id_list, columns=["name", "job", "age"])
In [262]:
# age column의 Nan 값들을 job별로(teacher, student)의 age 중앙값으로 채운다.
df['age'].fillna(df.groupby('job')['age'].transform('median'), inplace = True)
df
Out[262]:
name | job | age | |
---|---|---|---|
0 | John | teacher | 40.0 |
1 | Nate | teacher | 35.0 |
2 | Yuna | teacher | 37.0 |
3 | Abraham | student | 10.0 |
4 | Brian | student | 12.0 |
5 | Janny | student | 11.0 |
6 | Nate | teacher | 37.0 |
7 | John | student | 11.0 |
10. apply 함수 활용¶
In [268]:
date_list = [
{"yyyy-mm-dd": "2000-06-27"},
{"yyyy-mm-dd": "2005-09-24"},
{"yyyy-mm-dd": "2007-12-20"}
]
df = pd.DataFrame(date_list, columns = ["yyyy-mm-dd"])
df
Out[268]:
yyyy-mm-dd | |
---|---|
0 | 2000-06-27 |
1 | 2005-09-24 |
2 | 2007-12-20 |
In [270]:
def extract_year(column):
return column.split("-")[0]
df['year'] = df['yyyy-mm-dd'].apply(extract_year)
df
Out[270]:
yyyy-mm-dd | year | |
---|---|---|
0 | 2000-06-27 | 2000 |
1 | 2005-09-24 | 2005 |
2 | 2007-12-20 | 2007 |
In [276]:
def get_age(year, current_year):
return current_year - int(year)
df['age'] = df['year'].apply(get_age, current_year = 2018) # year의 파라미터는 명시하지 않았는데, year column의 element가 자동으로 들어간다.
df
Out[276]:
yyyy-mm-dd | year | age | |
---|---|---|---|
0 | 2000-06-27 | 2000 | 18 |
1 | 2005-09-24 | 2005 | 13 |
2 | 2007-12-20 | 2007 | 11 |
In [281]:
def get_introduce(age, prefix, suffix):
return prefix + str(age) + suffix
df['introduce'] = df['age'].apply(get_introduce, prefix = "I am ", suffix = " years old.")
df
Out[281]:
yyyy-mm-dd | year | age | introduce | |
---|---|---|---|---|
0 | 2000-06-27 | 2000 | 18 | I am 18 years old. |
1 | 2005-09-24 | 2005 | 13 | I am 13 years old. |
2 | 2007-12-20 | 2007 | 11 | I am 11 years old. |
In [283]:
# 여러개의 column에 apply function 사용하기
def get_introduce_2(row):
return "I was born in " + str(row.year) + " my age is " + str(row.age)
df.introduce = df.apply(get_introduce_2, axis = 1)
df
Out[283]:
yyyy-mm-dd | year | age | introduce | |
---|---|---|---|---|
0 | 2000-06-27 | 2000 | 18 | I was born in 2000 my age is 18 |
1 | 2005-09-24 | 2005 | 13 | I was born in 2005 my age is 13 |
2 | 2007-12-20 | 2007 | 11 | I was born in 2007 my age is 11 |
11. map, applymap 함수 활용¶
In [288]:
friends = [
{"age": 15, "job": "student"},
{"age": 25, "job": "developer"},
{"age": 30, "job": "teacher"}
]
df = pd.DataFrame(friends, columns = ["age", "job"])
df
Out[288]:
age | job | |
---|---|---|
0 | 15 | student |
1 | 25 | developer |
2 | 30 | teacher |
In [290]:
# map: column별로 적용할 때 사용
df.job = df.job.map({"student": 1, "developer": 2, "teacher": 3})
df
Out[290]:
age | job | |
---|---|---|
0 | 15 | 1 |
1 | 25 | 2 |
2 | 30 | 3 |
In [291]:
x_y_z = [
{"x": 5.5, "y": -5.6, "z": -1.1},
{"x": -5.2, "y": 5.5, "z": -2.2},
{"x": -1.6, "y": -4.5, "z": -3.3}
]
df = pd.DataFrame(x_y_z)
df
Out[291]:
x | y | z | |
---|---|---|---|
0 | 5.5 | -5.6 | -1.1 |
1 | -5.2 | 5.5 | -2.2 |
2 | -1.6 | -4.5 | -3.3 |
In [296]:
# applymap: dataframe 전체에 적용할 때 사용
df = df.applymap(np.around) # np.around는 함수이다.
df
Out[296]:
x | y | z | |
---|---|---|---|
0 | 6.0 | -6.0 | -1.0 |
1 | -5.0 | 6.0 | -2.0 |
2 | -2.0 | -4.0 | -3.0 |
12. Column 내에서 유니크한 값 list 뽑아내고 갯수 확인하기¶
In [298]:
job_list = [{'name': 'John', 'job': "teacher"},
{'name': 'Nate', 'job': "teacher"},
{'name': 'Fred', 'job': "teacher"},
{'name': 'Abraham', 'job': "student"},
{'name': 'Brian', 'job': "student"},
{'name': 'Janny', 'job': "developer"},
{'name': 'Nate', 'job': "teacher"},
{'name': 'Obrian', 'job': "dentist"},
{'name': 'Yuna', 'job': "teacher"},
{'name': 'Rob', 'job': "lawyer"},
{'name': 'Brian', 'job': "student"},
{'name': 'Matt', 'job': "student"},
{'name': 'Wendy', 'job': "banker"},
{'name': 'Edward', 'job': "teacher"},
{'name': 'Ian', 'job': "teacher"},
{'name': 'Chris', 'job': "banker"},
{'name': 'Philip', 'job': "lawyer"},
{'name': 'Janny', 'job': "basketball player"},
{'name': 'Gwen', 'job': "teacher"},
{'name': 'Jessy', 'job': "student"}
]
df = pd.DataFrame(job_list, columns = ['name', 'job'])
df
Out[298]:
name | job | |
---|---|---|
0 | John | teacher |
1 | Nate | teacher |
2 | Fred | teacher |
3 | Abraham | student |
4 | Brian | student |
5 | Janny | developer |
6 | Nate | teacher |
7 | Obrian | dentist |
8 | Yuna | teacher |
9 | Rob | lawyer |
10 | Brian | student |
11 | Matt | student |
12 | Wendy | banker |
13 | Edward | teacher |
14 | Ian | teacher |
15 | Chris | banker |
16 | Philip | lawyer |
17 | Janny | basketball player |
18 | Gwen | teacher |
19 | Jessy | student |
In [300]:
df.job.unique()
Out[300]:
array(['teacher', 'student', 'developer', 'dentist', 'lawyer', 'banker', 'basketball player'], dtype=object)
In [302]:
df.job.value_counts()
Out[302]:
teacher 8 student 5 lawyer 2 banker 2 dentist 1 basketball player 1 developer 1 Name: job, dtype: int64
13. 두개의 데이터프레임 합치기¶
In [303]:
l1 = [{'name': 'John', 'job': "teacher"},
{'name': 'Nate', 'job': "student"},
{'name': 'Fred', 'job': "developer"}]
l2 = [{'name': 'Ed', 'job': "dentist"},
{'name': 'Jack', 'job': "farmer"},
{'name': 'Ted', 'job': "designer"}]
df1 = pd.DataFrame(l1, columns = ['name', 'job'])
df2 = pd.DataFrame(l2, columns = ['name', 'job'])
In [305]:
df1
Out[305]:
name | job | |
---|---|---|
0 | John | teacher |
1 | Nate | student |
2 | Fred | developer |
In [306]:
df2
Out[306]:
name | job | |
---|---|---|
0 | Ed | dentist |
1 | Jack | farmer |
2 | Ted | designer |
In [308]:
# df1과 df2를 row로 합치기 (방법 1)
result = pd.concat([df1, df2], ignore_index = True)
result
Out[308]:
name | job | |
---|---|---|
0 | John | teacher |
1 | Nate | student |
2 | Fred | developer |
3 | Ed | dentist |
4 | Jack | farmer |
5 | Ted | designer |
In [310]:
# df1과 df2를 row로 합치기 (방법 2)
result = df1.append(df2, ignore_index = True)
result
Out[310]:
name | job | |
---|---|---|
0 | John | teacher |
1 | Nate | student |
2 | Fred | developer |
3 | Ed | dentist |
4 | Jack | farmer |
5 | Ted | designer |
In [315]:
l3 = [{'name': 'John', 'job': "teacher"},
{'name': 'Nate', 'job': "student"},
{'name': 'Jack', 'job': "developer"}]
l4 = [{'age': 25, 'country': "U.S."},
{'age': 30, 'country': "U.K."},
{'age': 45, 'country': "Korea"}]
df3 = pd.DataFrame(l3, columns = ['name', 'job'])
df4 = pd.DataFrame(l4, columns = ['age', 'country'])
In [313]:
df3
Out[313]:
name | job | |
---|---|---|
0 | John | teacher |
1 | Nate | student |
2 | Jack | developer |
In [314]:
df4
Out[314]:
age | country | |
---|---|---|
0 | 25 | U.S. |
1 | 30 | U.K. |
2 | 45 | Korea |
In [320]:
# df3과 df4를 column으로 합치기
result = pd.concat([df3, df4], axis = 1, ignore_index = False)
result
Out[320]:
name | job | age | country | |
---|---|---|---|---|
0 | John | teacher | 25 | U.S. |
1 | Nate | student | 30 | U.K. |
2 | Jack | developer | 45 | Korea |
Pandas 팬더스 데이터분석 기초 실습 - 인프런
데이터 과학 실무에서 유용하게 사용되는 Pandas 명령어들을 짧은 강의 형식으로 소개하며, 강의에 사용된 모든 코드는 깃허브에서 직접 보시고, 다운받아 실습하실 수 있습니다. 초급 데이터 분
www.inflearn.com
'머신러닝' 카테고리의 다른 글
[머신러닝] 대용량 csv 파일 분할하기 (0) | 2021.02.26 |
---|---|
[모델] 2018년도 1~12월 요일별 따릉이 이용량 (0) | 2021.02.24 |
[머신러닝]신경망 시각화 영상 (개요) (0) | 2021.01.20 |
[머신러닝] Linear Regression, Logistic Classification, Softmax Regression (0) | 2021.01.20 |
[팬더스] 6. map , applymap , unique , value_counts 함수 활용 (0) | 2021.01.19 |
Comments