Beautiful Soup fails to recognize UTF-8 encoding on Python 3, IPython 6 console -

- July 25, 2014

i trying read xml document using beautiful soup on python 3.6.2, ipython 6.1.0, windows 10, , can't encoding right.

here's test xml, saved file in utf8-encoding:

<?xml version="1.0" encoding="utf-8"?> <root> <info name="愛よ">ÜÜÜÜÜÜÜ</info> <items> <item thing="ÖöÖö">"23Äßßß"</item> </items> </root>

first check xml using elementtree:

import xml.etree.elementtree et  def printxml(xml,indent=''):     print(indent+str(xml.tag)+': '+(xml.text if xml.text not none else '').replace('\n',''))     if len(xml.attrib) > 0:         k,v in xml.attrib.items():             print(indent+'\t'+k+' - '+v)     if xml.getchildren():         child in xml.getchildren():             printxml(child,indent+'\t')  xml0 = et.parse("test.xml").getroot() printxml(xml0)

the output correct:

root:          info: ÜÜÜÜÜÜÜ                 name - 愛よ         items:                  item: "23Äßßß"                         thing - ÖöÖö

now read same file beautiful soup , pretty-print it:

import bs4  open("test.xml") ff:     xml = bs4.beautifulsoup(ff,"html5lib") print(xml.prettify())

output:

<!--?xml version="1.0" encoding="utf-8"?--> <html>  <head>  </head>  <body>   <root>    <info name="æ„›ã‚ˆ">     ÃœÃœÃœÃœÃœÃœÃœ    </info>    <items>     <item thing="Ã–Ã¶Ã–Ã¶">      "23Ã„ÃŸÃŸÃŸ"     </item>    </items>   </root>  </body> </html>

this wrong. doing call explicite encoding specified bs4.beautifulsoup(ff,"html5lib",from_encoding="utf-8") doesn't change result.

doing

print(xml.original_encoding)

outputs

none

so beautiful soup apparently unable detect original encoding though file encoded in utf8 (according notepad++) , header information says utf-8 well, , have chardet installed as doc recommends.

am making mistake here? causing this?

edit: when invoke code without html5lib warning:

userwarning: no parser explicitly specified, i'm using best available html parser system ("html5lib").  isn't problem, if run code on system, or in different virtual environment,  may use different parser , behave differently.  code caused warning on line 241 of file c:\users\my.name\appdata\local\continuum\anaconda2\envs\python3\lib\site-packages\spyder\utils\ipython\start_kernel.py.  rid of warning, change code looks this:   beautifulsoup(your_markup})  this:   beautifulsoup(your_markup, "html5lib")    markup_type=markup_type))

edit 2:

as suggested in comment tried bs4.beautifulsoup(ff,"html.parser"), problem remains.

then installed lxml , tried bs4.beautifulsoup(ff,"lxml-xml"), still same output.

what strikes me odd when specifying encoding bs4.beautifulsoup(ff,"lxml-xml",from_encoding='utf-8') value of xml.original_encoding none contrary written in doc.

edit 3:

i put xml contents string

xmlstring = "<?xml version=\"1.0\" encoding=\"utf-8\"?><root><info name=\"愛よ\">ÜÜÜÜÜÜÜ</info><items><item thing=\"ÖöÖö\">\"23Äßßß\"</item></items></root>"

and used bs4.beautifulsoup(xmlstring,"lxml-xml"), i'm getting correct output:

<?xml version="1.0" encoding="utf-8"?> <root>  <info name="愛よ">   ÜÜÜÜÜÜÜ  </info>  <items>   <item thing="ÖöÖö">    "23Äßßß"   </item>  </items> </root>

so seems wrong file after all.

found error, have specify encoding when opening file:

with open("test.xml",encoding='utf-8') ff:     xml = bs4.beautifulsoup(ff,"html5lib")

as i'm on python 3 thought value of encoding utf-8 default, turned out it's system-dependent , on system it's cp1252.

wiki

Search This Blog

tL

Beautiful Soup fails to recognize UTF-8 encoding on Python 3, IPython 6 console -

Comments

Post a Comment

Popular posts from this blog

python - Read npy file directly from S3 StreamingBody -

Asterisk AGI Python Script to Dialplan does not work -

kotlin - Out-projected type in generic interface prohibits the use of metod with generic parameter -