python - Extracting and parsing dates in a pandas dataframe -




i trying convert messy notebook dates sorted date series in pandas.

0           03/25/93 total time of visit (in minutes):\n 1                         6/18/85 primary care doctor:\n 2      sshe plans move of 7/8/71 in-home servic... 3                  7 on 9/27/75 audit c score current:\n 4      2/6/96 sleep studypain treatment pain level (n... 5                      .per 7/06/79 movement d/o note:\n 6      4, 5/18/78 patient's thoughts current su... 7      10/24/89 cpt code: 90801 - psychiatric diagnos... 8                           3/7/86 sos-10 total score:\n 9               (4/10/71)score-1audit c score current:\n 10     (5/11/85) crt-1.96, bun-26; ast/alt-16/22; wbc... 11                         4/09/75 sos-10 total score:\n 12     8/01/98 communication referring physician... 13     1/26/72 communication referring physician... 14     5/24/1990 cpt code: 90792: medical servic... 15     1/25/2011 cpt code: 90792: medical servic... 

i have multiple dates formats such 04/20/2009; 04/20/09; 4/20/09; 4/3/09. , convert these mm/dd/yyyy new column.

so far have done

df2['date']= df2['text'].str.extractall(r'(\d{1,2}[/-]\d{1,2}[/-]\d{2,})') 

also, not how extract lines mm/yy or yyyy format date without interfering code above. bear in mind absence of day or month consider 1st , january default values.

you can use pd.series.str.extract regex, , apply pd.to_datetime:

df['date'] = df.text.str.extract(r'(?p<date>\d+(?:\/\d+){2})', expand=false)\                                                                .apply(pd.to_datetime)  df                                                   text       date 0                                                                0        03/25/93 total time of visit (in minutes):\n 1993-03-25 1                      6/18/85 primary care doctor:\n 1985-06-18 2   sshe plans move of 7/8/71 in-home servic... 1971-07-08 3               7 on 9/27/75 audit c score current:\n 1975-09-27 4   2/6/96 sleep studypain treatment pain level (n... 1996-02-06 5                   .per 7/06/79 movement d/o note:\n 1979-07-06 6   4, 5/18/78 patient's thoughts current su... 1978-05-18 7   10/24/89 cpt code: 90801 - psychiatric diagnos... 1989-10-24 8                        3/7/86 sos-10 total score:\n 1986-03-07 9            (4/10/71)score-1audit c score current:\n 1971-04-10 10  (5/11/85) crt-1.96, bun-26; ast/alt-16/22; wbc... 1985-05-11 11                      4/09/75 sos-10 total score:\n 1975-04-09 12  8/01/98 communication referring physician... 1998-08-01 13  1/26/72 communication referring physician... 1972-01-26 14  5/24/1990 cpt code: 90792: medical servic... 1990-05-24 15  1/25/2011 cpt code: 90792: medical servic... 2011-01-25 

str.extract returns series of strings this:

array(['03/25/93', '6/18/85', '7/8/71', '9/27/75', '2/6/96', '7/06/79',        '5/18/78', '10/24/89', '3/7/86', '4/10/71', '5/11/85', '4/09/75',        '8/01/98', '1/26/72', '5/24/1990', '1/25/2011'], dtype=object) 

regex details

(?p<date>\d+(?:\/\d+){2}) 
  • (?p<date>....) - named capturing group
  • \d+ 1 or more digits
  • (?:\/\d+){2} - non-capturing group repeating twice,
    • \/ - escaped forward slash
    • {2} - repeater (two times)

regex missing days

to handle optional days, modified regex required:

(?p<date>(?:\d+\/)?\d+/\d+) 

details

  • (?p<date>....) - named capturing group
  • (?:\d+\/)? - nested group (non-capturing) \d+\/ optional.
  • \d+ 1 or more digits
  • \/ escaped forward slash

the rest same. substitute regex in place of current one. pd.to_datetime handle missing days.





wiki

Comments

Popular posts from this blog

Asterisk AGI Python Script to Dialplan does not work -

python - Read npy file directly from S3 StreamingBody -

kotlin - Out-projected type in generic interface prohibits the use of metod with generic parameter -