Beautiful Soup fails to recognize UTF-8 encoding on Python 3, IPython 6 console -
i trying read xml document using beautiful soup on python 3.6.2, ipython 6.1.0, windows 10, , can't encoding right.
here's test xml, saved file in utf8-encoding:
<?xml version="1.0" encoding="utf-8"?> <root> <info name="愛よ">ÜÜÜÜÜÜÜ</info> <items> <item thing="ÖöÖö">"23Äßßß"</item> </items> </root> first check xml using elementtree:
import xml.etree.elementtree et def printxml(xml,indent=''): print(indent+str(xml.tag)+': '+(xml.text if xml.text not none else '').replace('\n','')) if len(xml.attrib) > 0: k,v in xml.attrib.items(): print(indent+'\t'+k+' - '+v) if xml.getchildren(): child in xml.getchildren(): printxml(child,indent+'\t') xml0 = et.parse("test.xml").getroot() printxml(xml0) the output correct:
root: info: ÜÜÜÜÜÜÜ name - 愛よ items: item: "23Äßßß" thing - ÖöÖö now read same file beautiful soup , pretty-print it:
import bs4 open("test.xml") ff: xml = bs4.beautifulsoup(ff,"html5lib") print(xml.prettify()) output:
<!--?xml version="1.0" encoding="utf-8"?--> <html> <head> </head> <body> <root> <info name="愛よ"> ÜÜÜÜÜÜÜ </info> <items> <item thing="ÖöÖö"> "23Äßßß" </item> </items> </root> </body> </html> this wrong. doing call explicite encoding specified bs4.beautifulsoup(ff,"html5lib",from_encoding="utf-8") doesn't change result.
doing
print(xml.original_encoding) outputs
none so beautiful soup apparently unable detect original encoding though file encoded in utf8 (according notepad++) , header information says utf-8 well, , have chardet installed as doc recommends.
am making mistake here? causing this?
edit: when invoke code without html5lib warning:
userwarning: no parser explicitly specified, i'm using best available html parser system ("html5lib"). isn't problem, if run code on system, or in different virtual environment, may use different parser , behave differently. code caused warning on line 241 of file c:\users\my.name\appdata\local\continuum\anaconda2\envs\python3\lib\site-packages\spyder\utils\ipython\start_kernel.py. rid of warning, change code looks this: beautifulsoup(your_markup}) this: beautifulsoup(your_markup, "html5lib") markup_type=markup_type)) edit 2:
as suggested in comment tried bs4.beautifulsoup(ff,"html.parser"), problem remains.
then installed lxml , tried bs4.beautifulsoup(ff,"lxml-xml"), still same output.
what strikes me odd when specifying encoding bs4.beautifulsoup(ff,"lxml-xml",from_encoding='utf-8') value of xml.original_encoding none contrary written in doc.
edit 3:
i put xml contents string
xmlstring = "<?xml version=\"1.0\" encoding=\"utf-8\"?><root><info name=\"愛よ\">ÜÜÜÜÜÜÜ</info><items><item thing=\"ÖöÖö\">\"23Äßßß\"</item></items></root>" and used bs4.beautifulsoup(xmlstring,"lxml-xml"), i'm getting correct output:
<?xml version="1.0" encoding="utf-8"?> <root> <info name="愛よ"> ÜÜÜÜÜÜÜ </info> <items> <item thing="ÖöÖö"> "23Äßßß" </item> </items> </root> so seems wrong file after all.
found error, have specify encoding when opening file:
with open("test.xml",encoding='utf-8') ff: xml = bs4.beautifulsoup(ff,"html5lib") as i'm on python 3 thought value of encoding utf-8 default, turned out it's system-dependent , on system it's cp1252.
wiki
Comments
Post a Comment