python - Extracting and parsing dates in a pandas dataframe -
i trying convert messy notebook dates sorted date series in pandas.
0 03/25/93 total time of visit (in minutes):\n 1 6/18/85 primary care doctor:\n 2 sshe plans move of 7/8/71 in-home servic... 3 7 on 9/27/75 audit c score current:\n 4 2/6/96 sleep studypain treatment pain level (n... 5 .per 7/06/79 movement d/o note:\n 6 4, 5/18/78 patient's thoughts current su... 7 10/24/89 cpt code: 90801 - psychiatric diagnos... 8 3/7/86 sos-10 total score:\n 9 (4/10/71)score-1audit c score current:\n 10 (5/11/85) crt-1.96, bun-26; ast/alt-16/22; wbc... 11 4/09/75 sos-10 total score:\n 12 8/01/98 communication referring physician... 13 1/26/72 communication referring physician... 14 5/24/1990 cpt code: 90792: medical servic... 15 1/25/2011 cpt code: 90792: medical servic...
i have multiple dates formats such 04/20/2009; 04/20/09; 4/20/09; 4/3/09. , convert these mm/dd/yyyy new column.
so far have done
df2['date']= df2['text'].str.extractall(r'(\d{1,2}[/-]\d{1,2}[/-]\d{2,})')
also, not how extract lines mm/yy or yyyy format date without interfering code above. bear in mind absence of day or month consider 1st , january default values.
you can use pd.series.str.extract
regex, , apply pd.to_datetime
:
df['date'] = df.text.str.extract(r'(?p<date>\d+(?:\/\d+){2})', expand=false)\ .apply(pd.to_datetime) df text date 0 0 03/25/93 total time of visit (in minutes):\n 1993-03-25 1 6/18/85 primary care doctor:\n 1985-06-18 2 sshe plans move of 7/8/71 in-home servic... 1971-07-08 3 7 on 9/27/75 audit c score current:\n 1975-09-27 4 2/6/96 sleep studypain treatment pain level (n... 1996-02-06 5 .per 7/06/79 movement d/o note:\n 1979-07-06 6 4, 5/18/78 patient's thoughts current su... 1978-05-18 7 10/24/89 cpt code: 90801 - psychiatric diagnos... 1989-10-24 8 3/7/86 sos-10 total score:\n 1986-03-07 9 (4/10/71)score-1audit c score current:\n 1971-04-10 10 (5/11/85) crt-1.96, bun-26; ast/alt-16/22; wbc... 1985-05-11 11 4/09/75 sos-10 total score:\n 1975-04-09 12 8/01/98 communication referring physician... 1998-08-01 13 1/26/72 communication referring physician... 1972-01-26 14 5/24/1990 cpt code: 90792: medical servic... 1990-05-24 15 1/25/2011 cpt code: 90792: medical servic... 2011-01-25
str.extract
returns series of strings this:
array(['03/25/93', '6/18/85', '7/8/71', '9/27/75', '2/6/96', '7/06/79', '5/18/78', '10/24/89', '3/7/86', '4/10/71', '5/11/85', '4/09/75', '8/01/98', '1/26/72', '5/24/1990', '1/25/2011'], dtype=object)
regex details
(?p<date>\d+(?:\/\d+){2})
(?p<date>....)
- named capturing group\d+
1 or more digits(?:\/\d+){2}
- non-capturing group repeating twice,\/
- escaped forward slash{2}
- repeater (two times)
regex missing days
to handle optional days
, modified regex required:
(?p<date>(?:\d+\/)?\d+/\d+)
details
(?p<date>....)
- named capturing group(?:\d+\/)?
- nested group (non-capturing)\d+\/
optional.\d+
1 or more digits\/
escaped forward slash
the rest same. substitute regex in place of current one. pd.to_datetime
handle missing days.
wiki
Comments
Post a Comment