python - Extracting and parsing dates in a pandas dataframe -
i trying convert messy notebook dates sorted date series in pandas.
0 03/25/93 total time of visit (in minutes):\n 1 6/18/85 primary care doctor:\n 2 sshe plans move of 7/8/71 in-home servic... 3 7 on 9/27/75 audit c score current:\n 4 2/6/96 sleep studypain treatment pain level (n... 5 .per 7/06/79 movement d/o note:\n 6 4, 5/18/78 patient's thoughts current su... 7 10/24/89 cpt code: 90801 - psychiatric diagnos... 8 3/7/86 sos-10 total score:\n 9 (4/10/71)score-1audit c score current:\n 10 (5/11/85) crt-1.96, bun-26; ast/alt-16/22; wbc... 11 4/09/75 sos-10 total score:\n 12 8/01/98 communication referring physician... 13 1/26/72 communication referring physician... 14 5/24/1990 cpt code: 90792: medical servic... 15 1/25/2011 cpt code: 90792: medical servic... i have multiple dates formats such 04/20/2009; 04/20/09; 4/20/09; 4/3/09. , convert these mm/dd/yyyy new column.
so far have done
df2['date']= df2['text'].str.extractall(r'(\d{1,2}[/-]\d{1,2}[/-]\d{2,})') also, not how extract lines mm/yy or yyyy format date without interfering code above. bear in mind absence of day or month consider 1st , january default values.
you can use pd.series.str.extract regex, , apply pd.to_datetime:
df['date'] = df.text.str.extract(r'(?p<date>\d+(?:\/\d+){2})', expand=false)\ .apply(pd.to_datetime) df text date 0 0 03/25/93 total time of visit (in minutes):\n 1993-03-25 1 6/18/85 primary care doctor:\n 1985-06-18 2 sshe plans move of 7/8/71 in-home servic... 1971-07-08 3 7 on 9/27/75 audit c score current:\n 1975-09-27 4 2/6/96 sleep studypain treatment pain level (n... 1996-02-06 5 .per 7/06/79 movement d/o note:\n 1979-07-06 6 4, 5/18/78 patient's thoughts current su... 1978-05-18 7 10/24/89 cpt code: 90801 - psychiatric diagnos... 1989-10-24 8 3/7/86 sos-10 total score:\n 1986-03-07 9 (4/10/71)score-1audit c score current:\n 1971-04-10 10 (5/11/85) crt-1.96, bun-26; ast/alt-16/22; wbc... 1985-05-11 11 4/09/75 sos-10 total score:\n 1975-04-09 12 8/01/98 communication referring physician... 1998-08-01 13 1/26/72 communication referring physician... 1972-01-26 14 5/24/1990 cpt code: 90792: medical servic... 1990-05-24 15 1/25/2011 cpt code: 90792: medical servic... 2011-01-25 str.extract returns series of strings this:
array(['03/25/93', '6/18/85', '7/8/71', '9/27/75', '2/6/96', '7/06/79', '5/18/78', '10/24/89', '3/7/86', '4/10/71', '5/11/85', '4/09/75', '8/01/98', '1/26/72', '5/24/1990', '1/25/2011'], dtype=object) regex details
(?p<date>\d+(?:\/\d+){2}) (?p<date>....)- named capturing group\d+1 or more digits(?:\/\d+){2}- non-capturing group repeating twice,\/- escaped forward slash{2}- repeater (two times)
regex missing days
to handle optional days, modified regex required:
(?p<date>(?:\d+\/)?\d+/\d+) details
(?p<date>....)- named capturing group(?:\d+\/)?- nested group (non-capturing)\d+\/optional.\d+1 or more digits\/escaped forward slash
the rest same. substitute regex in place of current one. pd.to_datetime handle missing days.
wiki
Comments
Post a Comment