class ICU::CharsetDetector
- ICU::CharsetDetector
- Reference
- Object
Overview
Charset detection
This class provides a facility for detecting the charset or encoding of character data in an unknown text format.
Usage
csdet = ICU::CharsetDetector.new
csm = csdet.detect("Sôme text")
csm.name # => "UTF-8"
csm.confidence # => 80
See also
Defined in:
icu/charset_detector.crConstructors
Class Method Summary
-
.detectable_charsets : Array(String)
Returns the list of detectable charsets
Instance Method Summary
-
#detect(text : String) : CharsetMatch
Return the charset that best matches the supplied input data
-
#detect_all(text : String) : Array(CharsetMatch)
Find all charset matches that appear to be consistent with the input.
-
#detectable_charsets : Array(String)
Returns the list of detectable charsets
- #finalize
- #to_unsafe : LibICU::UCharsetDetector
Constructor Detail
Class Method Detail
Returns the list of detectable charsets
Instance Method Detail
Return the charset that best matches the supplied input data
csdet = ICU::CharsetDetector.new
csm = csdet.detect("Some text")
csm.name # => "ISO-8859-1"
csm.confidence # => 30
csm = csdet.detect("Sôme other text")
csm.name # => "UTF-8"
csm.confidence # => 80
FIXME not thread-safe
Find all charset matches that appear to be consistent with the input. The results are ordered with the best quality match first.
csms = csdet.detect_all("Some text")
csdet.detect_all(str).map { |csm| {name: csm.name, confidence: csm.confidence} }
# => [{name: "ISO-8859-1", confidence: 30},
# {name: "ISO-8859-2", confidence: 30},
# {name: "UTF-8", confidence: 15},
# {name: "UTF-16BE", confidence: 10},
# {name: "UTF-16LE", confidence: 10}]
FIXME not thread-safe
Returns the list of detectable charsets
ICU::CharsetDetector.new.detectable_charsets
# => ["UTF-8",
# "UTF-16BE",
# "UTF-16LE",
# ...]