Avro Data Types

Avro Data Types

This article explains the supported Data Types by Avro. Avro currently supports the follow Primitive Types:
null: no value
boolean: a binary value
int: 32-bit signed integer
long: 64-bit signed integer
float: single precision (32-bit) IEEE 754 floating-point number
double: double precision (64-bit) IEEE 754 floating-point number
bytes: sequence of 8-bit unsigned bytes
string: unicode character sequence
More details can be found on AVRO’s Apache documentation page. If you intend to create a column with TINYINT or SMALLINT for a AVRO table, you will get Undefined name: “TINYINT” error in CDH5.4.x, or it will fall back to INT data type automatically in CDH5.5.x. DECIMAL in AVRO is currently supported as Logical Type, rather than a Primitive Type, see the Doc Page again for more details. You will need to define DECIMAL type in the following way in the schema file:
{
  "type": "bytes",
  "logicalType": "decimal",
  "precision": 4,
  "scale": 2
}
There are performance differences between DECIMAL and FLOAT/DOUBLE that we need to be aware of. FLOAT and DOUBLE take up a fixed number of bytes and operations are supported in hardware, which makes them fast and a fixed size in Avro (4 or 8 bytes). DECIMAL is slightly different because it is a type that avoids floating point representation errors. That makes it the only choice for some use cases, like financial calculations. The performance implications are usually less important than the added accuracy. In general, prefer FLOAT/DOUBLE over DECIMAL unless you need to avoid the error. For example, FLOAT is fine if I’m storing a student’s current GPA (I can always recalculate it from grades) but it’s not okay to use a FLOAT for the amount of money in your bank account (which should never be inaccurate). The downside of using DECIMAL in AVRO is that schema evolution is currently not tested, meaning once you have written DECIMAL data into AVRO table, you should not change the underlying representation, or the data type of the column, which might lead to data corruption or other issues, which is not known at this stage.

3 Comments

  1. Atul

    Hi Eric,

    After I defined the decimal column as below in avsc schema file. In hive I see the dataype of decimal column as BYTE. Any idea on this

    {“type”: “record”,
    “name”: “test_avro”,
    “fields”: [
    {“name”:”abdc”, “type” : [ “null”,”string” ]},
    {“name”:”adfaf”, “type” : [ “null”,”string” ]},
    { “name”:”adfad”,
    “type”: “bytes”,
    “logicalType”: “decimal”,
    “precision”: 4,
    “scale”: 2
    }
    ]}

Leave a Reply

Your email address will not be published.

My new Snowflake Blog is now live. I will not be updating this blog anymore but will continue with new contents in the Snowflake world!