Leveraging Relationships: Graph-Powered Data Science and Machine Learning - Part 2

By
Aoife McCardle
Digital Marketing Executive

Pandemic 


To use the pandemic as an example, the traditional method assumes all members of the population are mixing at once which would mean that all infected individuals have an equal rate of infecting all members of the population or sub population that they belong to. In truth, most of us only have daily contact with our family, co-workers and people on public transportation, meaning we can ignore the rest of the population as we do not have a direct connection with them, creating graphs with individuals only connected to a few others.


When examining data in a graph, we can see the degree of a node, this is the number of neighbouring nodes it is connected to, the higher the number of connections, the higher the degree. The more connections, the more susceptible it is to influence from the other nodes. We can also examine the betweeness of nodes, which is how many shortest paths pass through each node. Another feature is the number of triangles a node is part of, three nodes connected to one another. Finally you have transitivity which shows how many of a node's neighbours are also neighbours. If you were to use a graph database to track the spread of a virus, those features can further be combined with a number of cases in each area and its nearby areas using a simple case of linear model to get an estimate of future cases.


Recommendations


Graphs can also be used to suggest recommendations. Previously Collaborative Filtering and Content-Based Filtering were the main methods of recommending products and features. Collaborative filtering makes recommendations based on what similar users like while content-based filtering searches for products with similar attributes e.g. a user would be recommended Interstellar because it shares some genres with Star Wars. However, when you start to consider relationships beyond these primary ones, the need for graphs becomes prevalent.


A knowledge graph shows how many ways you can make recommendations e.g. if they like the movie “Cloud Atlas”, they may also like “Catch Me If You Can” because Tom Hanks stars in both. Alternatively, if they are looking to buy a book we can recommend the book Cloud atlas. Although these suggestions are technically possible in a traditional SQL database, it becomes increasingly complicated and extremely slow. Graphs allow you to focus on only the nodes with connections, ignoring the irrelevant observations and saving significant resources.


Which Graph Technology System Should I Use?


Neo4J is referred to as the native graph database because it uses the properties of graph models down at the storage level, storing the data in different files. It uses Cypher as its query language which is a language very similar to SQL but optimised for graphs, it enables constant time traversals due to the fixed representation of data we have mentioned and it can scale up to billions of nodes. It is flexible which means that the schema can change as the business requirements evolve much like in a no-SQL database. Of course it provides drivers for several languages like Python, Java, JavaScript etc. Also it has a library which provides many graph data-science algorithms ready to use. Our popular choices include amazon Neptune which works with a less intuitive language, Gremlin, and because it's amazon you are locked into AWS. With Neo4j you can move between cloud providers or on premises storage, also there is an abundance of documentation and resources for Neo4j compared to Neptune. 


Another option for running graph algorithms is GraphX in Apache Spark, it's not a graph native as it just allows you to run the algorithms in a distributed fashion and the data storage is not optimised for graphs. However data can be transmitted between Neo4j and GraphX if you want to use both systems, Neo4j for storage and GraphX for computation.


Automate to Free Up Resources


While this is a very effective strategy, to maintain the benefits it is essential to free up human resources through Artificial Intelligence (AI) and Machine Learning (ML) automation. 


AI/ML based technologies not only process data for easier human consumption, but they can learn and adapt from the data they process, taking automation further by constantly testing what works to make smarter assumptions in future experiments. 


Using a real-time ML engine to process the graph data science algorithms means that millions of nodes and relationships can be analysed quickly and quietly in the background, allowing for employees to focus on development, creativity and future strategy while keeping costs down significantly.


Written by:
Aoife McCardle
Contact email:
marketing@novafutur.com