null: no value boolean: a binary value int: 32-bit signed integer long: 64-bit signed integer float: single precision (32-bit) IEEE 754 floating-point number double: double precision (64-bit) IEEE 754 floating-point number bytes: sequence of 8-bit unsigned bytes string: unicode character sequenceMore details can be found on AVRO’s Apache documentation page. If you intend to create a column with TINYINT or SMALLINT for a AVRO table, you will get Undefined name: “TINYINT” error in CDH5.4.x, or it will fall back to INT data type automatically in CDH5.5.x. DECIMAL in AVRO is currently supported as Logical Type, rather than a Primitive Type, see the Doc Page again for more details. You will need to define DECIMAL type in the following way in the schema file:
{ "type": "bytes", "logicalType": "decimal", "precision": 4, "scale": 2 }There are performance differences between DECIMAL and FLOAT/DOUBLE that we need to be aware of. FLOAT and DOUBLE take up a fixed number of bytes and operations are supported in hardware, which makes them fast and a fixed size in Avro (4 or 8 bytes). DECIMAL is slightly different because it is a type that avoids floating point representation errors. That makes it the only choice for some use cases, like financial calculations. The performance implications are usually less important than the added accuracy. In general, prefer FLOAT/DOUBLE over DECIMAL unless you need to avoid the error. For example, FLOAT is fine if I’m storing a student’s current GPA (I can always recalculate it from grades) but it’s not okay to use a FLOAT for the amount of money in your bank account (which should never be inaccurate). The downside of using DECIMAL in AVRO is that schema evolution is currently not tested, meaning once you have written DECIMAL data into AVRO table, you should not change the underlying representation, or the data type of the column, which might lead to data corruption or other issues, which is not known at this stage.
Hi Eric,
After I defined the decimal column as below in avsc schema file. In hive I see the dataype of decimal column as BYTE. Any idea on this
{“type”: “record”,
“name”: “test_avro”,
“fields”: [
{“name”:”abdc”, “type” : [ “null”,”string” ]},
{“name”:”adfaf”, “type” : [ “null”,”string” ]},
{ “name”:”adfad”,
“type”: “bytes”,
“logicalType”: “decimal”,
“precision”: 4,
“scale”: 2
}
]}
Instead of BYTE its BINARY in below line. Typo mistake.
“In hive I see the dataype of decimal column as BINARY.”
Hi Atul,
Firstly, thanks for visiting my blog.
The “binary” data type showing in Hive is expected, please see the AVRO official documentation:
https://cwiki.apache.org/confluence/display/Hive/AvroSerDe
Have a look at the section under “Avro to Hive type conversion”, and you will see that “bytes” in AVRO is mapped to “binary” in Hive.
Cheers